2017 IEEE International Conference on Big Data (BIGDATA) 978-1-5386-2715-0/17/$31.00 ©2017 IEEE 817 Sampling Algorithms to Update Truncated SVD Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra University of Tennessee, Knoxville, Tennessee, U.S.A. Abstract A truncated singular value decomposition (SVD) is a powerful tool for analyzing modern datasets. However, the massive volume and rapidly changing nature of the datasets often make it too expensive to compute the SVD of the whole dataset at once. It is more attractive to use only a part of the dataset at a time and incrementally update the SVD. A randomized algorithm has been shown to be a great alternative to a traditional updating algorithm due to its ability to efficiently filter out the noises and extract the relevant features of the dataset. Though it is often faster than the traditional algorithm, in order to extract the relevant features, the randomized algorithm may need to accesses the data multiple times, and this data access creates a significant performance bottleneck. To improve the performance of the randomized algorithm for updating SVD, we study, in this paper, two sampling algorithms that access the data only two or three times, respectively. We present several case studies to show that only a small fraction of the data may be needed to maintain the quality of the updated SVD, while our performance results on a hybrid CPU/GPU computer demonstrate the potential of the sampling algorithms to improve the performance of the randomized algorithm. Index Terms—sample; randomize; update SVD; out-of-core; I. I NTRODUCTION To analyze the modern datasets with a wide variety and veracity, a truncated singular value decomposition (SVD) [1] of the matrix representing the data is a powerful tool. The ability of the SVD to filter out noises and extract the underly- ing features of the data has been demonstrated in many data analysis tools, including Latent Semantic Indexing (LSI) [2], recommendation systems [3], population clustering [4], and subspace tracking [5]. Also, as the modern datasets are constantly being updated and analyzed, we develop a good understanding of the data (e.g., the singular value distribution), which can be used to tune the performance or the robustness of computing the SVD for that particular application (e.g., the required numerical rank for the accurate data analysis, or the number of data passes needed to compute the SVD). Furthermore, these tuning parameters stay roughly the same for different datasets from the same applications. With the increase in the external storage capacity, the amount of data generated from the observations, experiments, and simulations has been growing at an unprecedented rate. These phenomena have led to the emergence of numerous massive datasets in many areas of studies including science, engineering, medicine, finance, social media, and e-commerce. The specific applications that generate the rapidly-changing massive datasets include the communication and electric grids, transportation and financial systems, personalized services on the internet, particle or astro physics, and genome sequencing. Hence, beside the variety and veracity of the dataset, the data analysis tool must address the challenges associated with the volume and velocity of the changes made to the dataset. For instance, the computers may not have enough compute power to accommodate such a rapidly growing or changing data if the computational complexity of the data analysis tool grows superlinearly with the data size. In addition, accessing the data through the local memory hierarchy is expensive, and accessing these data in the external storage is even more costly. Therefore, the data analysis tool needs to be data-pass efficient. In particular, it may become too costly to compute the SVD of the whole dataset at once, or to recompute the SVD every time the changes are made to the dataset. In some applications, recomputing the SVD may not even be possible because the original data, for which the SVD has been already computed, is no longer available. To address these challenges, an attractive approach is to update (rather than recompute) the SVD. For example, we can incrementally update the SVD using only a part of the matrix that fit in the core memory at a time. Hence, the whole matrix is moved to the core memory only once. A randomized algorithm has been shown to be an efficient method to update SVD [6]. To reduce both the computational and data access costs, it projects the data onto a smaller subspace before computing the updated SVD. Compared with the state-of-the-art updating algorithm [7], the randomized algorithm often compresses the data into a smaller projection subspace with a lower communication latency cost. As a result, the randomized algorithm could obtain much higher performance on a modern computer, where the communica- tion has become significantly more expensive compared with the arithmetic operations, both in terms of time and energy consumption. In addition, the randomized algorithm accesses the data only through the dense or sparse matrix-matrix multiplication (GEMM or SpMM) whose highly-optimized implementations are provided by many vendors. In other applications, the external storage (e.g., database) may provide a functionality to compute the matrix multiplication and only transfer the resulting vectors to the memory, thus avoiding the explicit generation and transfer of the matrix into the memory. To filter out the noises and extract the relevant features, however, the randomized algorithm may require multiple data passes that become the performance bottleneck. In this paper, we use two methods to reduce this bottleneck. 1) We integrate data sampling into the randomized algorithm. Namely, we first sample the new data using the information gathered while compressing the previous data. Then, the randomized algo- rithm only uses the sampled data (which fits in the memory) to update the SVD. We present two sampling algorithms,