2017 IEEE International Conference on Big Data (BIGDATA)
978-1-5386-2715-0/17/$31.00 ©2017 IEEE 817
Sampling Algorithms to Update Truncated SVD
Ichitaro Yamazaki, Stanimire Tomov, and Jack Dongarra
University of Tennessee, Knoxville, Tennessee, U.S.A.
Abstract—
A truncated singular value decomposition (SVD) is a powerful
tool for analyzing modern datasets. However, the massive volume
and rapidly changing nature of the datasets often make it too
expensive to compute the SVD of the whole dataset at once. It
is more attractive to use only a part of the dataset at a time
and incrementally update the SVD. A randomized algorithm has
been shown to be a great alternative to a traditional updating
algorithm due to its ability to efficiently filter out the noises
and extract the relevant features of the dataset. Though it is
often faster than the traditional algorithm, in order to extract
the relevant features, the randomized algorithm may need to
accesses the data multiple times, and this data access creates a
significant performance bottleneck. To improve the performance
of the randomized algorithm for updating SVD, we study, in this
paper, two sampling algorithms that access the data only two or
three times, respectively. We present several case studies to show
that only a small fraction of the data may be needed to maintain
the quality of the updated SVD, while our performance results
on a hybrid CPU/GPU computer demonstrate the potential of
the sampling algorithms to improve the performance of the
randomized algorithm.
Index Terms—sample; randomize; update SVD; out-of-core;
I. I NTRODUCTION
To analyze the modern datasets with a wide variety and
veracity, a truncated singular value decomposition (SVD) [1]
of the matrix representing the data is a powerful tool. The
ability of the SVD to filter out noises and extract the underly-
ing features of the data has been demonstrated in many data
analysis tools, including Latent Semantic Indexing (LSI) [2],
recommendation systems [3], population clustering [4], and
subspace tracking [5]. Also, as the modern datasets are
constantly being updated and analyzed, we develop a good
understanding of the data (e.g., the singular value distribution),
which can be used to tune the performance or the robustness
of computing the SVD for that particular application (e.g.,
the required numerical rank for the accurate data analysis,
or the number of data passes needed to compute the SVD).
Furthermore, these tuning parameters stay roughly the same
for different datasets from the same applications.
With the increase in the external storage capacity, the
amount of data generated from the observations, experiments,
and simulations has been growing at an unprecedented rate.
These phenomena have led to the emergence of numerous
massive datasets in many areas of studies including science,
engineering, medicine, finance, social media, and e-commerce.
The specific applications that generate the rapidly-changing
massive datasets include the communication and electric grids,
transportation and financial systems, personalized services on
the internet, particle or astro physics, and genome sequencing.
Hence, beside the variety and veracity of the dataset, the data
analysis tool must address the challenges associated with the
volume and velocity of the changes made to the dataset. For
instance, the computers may not have enough compute power
to accommodate such a rapidly growing or changing data if
the computational complexity of the data analysis tool grows
superlinearly with the data size. In addition, accessing the
data through the local memory hierarchy is expensive, and
accessing these data in the external storage is even more costly.
Therefore, the data analysis tool needs to be data-pass efficient.
In particular, it may become too costly to compute the SVD
of the whole dataset at once, or to recompute the SVD every
time the changes are made to the dataset. In some applications,
recomputing the SVD may not even be possible because the
original data, for which the SVD has been already computed, is
no longer available. To address these challenges, an attractive
approach is to update (rather than recompute) the SVD. For
example, we can incrementally update the SVD using only a
part of the matrix that fit in the core memory at a time. Hence,
the whole matrix is moved to the core memory only once.
A randomized algorithm has been shown to be an efficient
method to update SVD [6]. To reduce both the computational
and data access costs, it projects the data onto a smaller
subspace before computing the updated SVD. Compared with
the state-of-the-art updating algorithm [7], the randomized
algorithm often compresses the data into a smaller projection
subspace with a lower communication latency cost. As a
result, the randomized algorithm could obtain much higher
performance on a modern computer, where the communica-
tion has become significantly more expensive compared with
the arithmetic operations, both in terms of time and energy
consumption. In addition, the randomized algorithm accesses
the data only through the dense or sparse matrix-matrix
multiplication (GEMM or SpMM) whose highly-optimized
implementations are provided by many vendors. In other
applications, the external storage (e.g., database) may provide
a functionality to compute the matrix multiplication and only
transfer the resulting vectors to the memory, thus avoiding the
explicit generation and transfer of the matrix into the memory.
To filter out the noises and extract the relevant features,
however, the randomized algorithm may require multiple data
passes that become the performance bottleneck. In this paper,
we use two methods to reduce this bottleneck. 1) We integrate
data sampling into the randomized algorithm. Namely, we first
sample the new data using the information gathered while
compressing the previous data. Then, the randomized algo-
rithm only uses the sampled data (which fits in the memory)
to update the SVD. We present two sampling algorithms,