Stream Clustering: Efficient Kernel-based Approximation using Importance Sampling Radha Chitta, Rong Jin and Anil K. Jain Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA chittara@msu.edu, rongjin@cse.msu.edu, jain@cse.msu.edu Abstract—Stream clustering methods, which group contin- uous, temporally ordered dynamic data instances, have been used in a number of applications such as stock market analysis, network analysis, and cosmological analysis. Most of the popular stream clustering algorithms are linear in nature, i.e. they assume that the data is linearly separable in the input space and use measures such as the Euclidean distance to define the inter- point similarity. Though these linear clustering algorithms are efficient, they do no achieve acceptable cluster quality on real- world data. Kernel-based clustering algorithms, which use non- linear similarity measures, yield better cluster quality, but are unsuitable for clustering data streams due to their high running time and memory complexity. We propose an efficient kernel- based clustering algorithm, called the Approximate Stream Kernel k-means, which uses importance sampling to sample a subset of the data stream, and clusters the entire stream based on each data point’s similarity to the sampled data points in real-time. Every time a data point is sampled, the kernel matrix representing the similarity between the sampled points is updated, and projected to a low dimensional space spanned by its top eigenvectors. The data points are clustered in this low-dimensional space using the k- means algorithm. Thus, the Approximate Stream Kernel k-means algorithm performs clustering in linear time using kernel-based similarity. We show that only a small number of points need to be sampled from the data stream, and the resulting approximation error is well-bounded. Using several large benchmark data sets to simulate data streams, we demonstrate that the proposed algorithm achieves a significant speedup over other kernel-based clustering algorithms with minimal loss in cluster quality. We also demonstrate the practical applicability of the proposed algorithm by using it to efficiently find trending topics in the Twitter stream data. KeywordsStream clustering, Kernel clustering, k-means, Twit- ter stream I. I NTRODUCTION In many applications related to stock trading, social networks, and communication networks, large amounts of data are gen- erated continuously at an extremely rapid rate. For example, about 1 terabyte of trade information is generated during each trading session in the New York Stock Exchange. Over 100, 000 tweets are published every 60 seconds by millions of users on Twitter 1 . This data needs to be analyzed in real-time to gain useful insights and make important decisions. Clustering is an important exploratory technique for group- ing and learning about data. Many efficient algorithms have been developed for clustering large data sets [1], [2], [3]. However, stream data introduces some additional challenges to clustering: (i) As the data is generated continuously and may be un- bounded, it is not possible to store all the data in memory. Each data point can be accessed at most once. 1 http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in- 2020.pdf (ii) The data is non-stationary, i.e. the characteristics of the stream data can change over time. This necessitates the cluster model (number of clusters, cluster representatives, cluster size and shape, etc.) to evolve dynamically. Batch clustering algorithms such as k-means and kernel k-means [4], assume that (i) the entire data to be clustered is available at the time of clustering, and (ii) the input data is drawn from a mixture of known distributions. For these reasons, batch clustering algorithms cannot be directly used to cluster stream data. Stream clustering algorithms generally consist of two stages: (i) an online phase, where the stream data is summarized into prototypes as it arrives, and (ii) an offline phase where these prototypes are used to obtain the clusters. The set of prototypes are dynamically updated to account for the evolution of the clusters in the stream data. Many stream clustering algorithms assume that the data is linearly separable in the input space and use measures such as the Euclidean distance to define the inter-point similarity. While these “linear” algorithms are efficient, they are not able to identify complex non-linearly separable clusters in real data sets as accurately as kernel-based clustering algorithms, which use non-linear pairwise similarity measures. However, kernel-based clustering algorithms such as kernel k-means and spectral clustering have at least quadratic running time complexity, and are ill-suited to data streams [5]. The kernel- based stream clustering algorithms, currently published in the literature, have high running time complexity, cannot perform real-time clustering, and usually require the selection of a large number of parameters (e.g. thresholds on the inter-cluster distance [6]), which are difficult to tune. In this article, we propose a variant of the kernel k- means algorithm, called approximate stream kernel k-means, which samples the data points as they arrive, with probability proportional to their “importance” in the stream, measured in terms of the statistical leverage scores [7]. An approximate kernel matrix is constructed, using the sampled points, and used to find the cluster centers. Clustering is performed in a low-dimensional space spanned by the top eigenvectors of the approximate kernel matrix. The running time complexity of the proposed algorithm is linear in N , the number of data points in the stream. We show that only a small subset of points needs to be sampled and stored in memory. As only the sampled points are used to perform clustering, the proposed algorithm is very efficient. We demonstrate empirically using several benchmark data sets that the proposed algorithm can cluster stream data sets at speeds up to 8 MBps with as few as 1, 000 sampled data points. Unlike other kernel-based stream clustering algorithms,