A Fast Approximate Kernel k-means Clustering Method For Large Data sets T. Hitendra Sarma and P. Viswanath Department of Computer Science and Engineering Rajeev Gandhi Memorial College of Eng. and Technology Nandyal-518501, A.P., India. Email: {hitendrasarma, viswanath.p}@ieee.org B. Eswara Reddy Department of Computer Science and Engineering JNTUA College of Engineering Anantapur-515002, A.P., India Email:eswarcsejntu@gmail.com Abstract—In unsupervised classification, kernel k-means clus- tering method has been shown to perform better than conven- tional k-means clustering method in identifying non-isotropic clusters in a data set. The space and time requirements of this method are O(n 2 ), where n is the data set size. The paper proposes a two stage hybrid approach to speed-up the kernel k-means clustering method. In the first stage, the data set is divided in to a number of group-lets by employing a fast clustering method called leaders clustering method. Each group-let is represented by a prototype called its leader. The set of leaders, which depends on a threshold parameter, can be derived in O(n) time. The paper presents a modification to the leaders clustering method where group-lets are found in the kernel space (not in the input space), but are represented by leaders in the input space. In the second stage, kernel k-means clustering method is applied with the set of leaders to derive a partition of the set of leaders. Finally, each leader is replaced by its group to get a partition of the data set. The proposed method has time complexity of O(n + p 2 ), where p is the leaders set size. Its space complexity is also O(n + p 2 ). The proposed method can be easily implemented. Experimental results shows that, with a small loss of quality, the proposed method can significantly reduce the time taken than the conventional kernel k-means clustering method. I. I NTRODUCTION Data clustering is a process of identifying the natural groupings that exists in a given data set, such that the objects in the same cluster are more similar and the objects in different clusters are less similar. It has been considered as an important tool in various applications like pattern recognition, image processing, data mining, remote sensing, statistics, etc., [1]. Clusters in the given data may be of different types, such as, isotropic, non-isotropic, linearly separable, non-linearly separable, etc. It has been observed that, when, data sets have isotropic and linearly separable clusters, sum-of-squares based partitioning methods, like k-means clustering method, are effective. On the other hand, kernel based clustering methods, like kernel k-means clustering method, are proved to be effective to identify clusters which are non-isotropic and linearly inseparable in the input space [2] [3]. Girolami [3] first proposed the kernel k-means clustering method. It is an iterative method. It first maps the data points from the input space to higher dimensional feature space through a non linear transformation φ(·) and then minimizes the clustering error in that feature space. The distance between a data point and a cluster center in the feature space can be computed using a kernel function without knowing the explicit form of the transformation [4]. This is, because, the dot product between two data points x and y in the feature space, which is φ(x) · φ(y), can be computed as a function k(x, y), where k : D × D R is called the kernel function. This is often known as the kernel trick and is valid for transformations that satisfies Mercer’s conditions [5]. Some standard kernel functions are given below. Polynomial kernel, k(x i ,x j )=(x i · x j + 1) d , Radial (RBF) kernel, k(x i ,x j )= exp(-r||x i - x j || 2 ), Neural kernel, k(x i ,x j )= tanh(ax i · x j + b), where a, b and d are positive constants. For two arbitrary data points x i and x j , very often, in the iterative process of clustering, k(x i ,x j ) is needed. So, a matrix called kernel matrix K =[k ij ] n×n is found, where the (i, j ) th entry is k ij = k(x i ,x j ). Here n is the data set size. The kernel matrix is precomputed and stored. So, the time and space requirements (which is given, in detail, in later sections) are O(n 2 ). This is the drawback of kernel k-means clustering method and because of which it is not suitable when the data set size is large. Other drawbacks being, (i) the number of clusters, k, should be given as input to the method, and (ii) the result is sensitive to the initial seed points, and the solution found may not be the optimal one because of the local minima problem. Several improvements are proposed to cater these draw- backs. In order to overcome the local minima problem, Likas et al. proposed The global kernel k-means method [6], which produces a final partition which is independent of the initial seed points. soft geodesic kernel k-means [7] method improves the quality of the clustering result by taking the internal data manifold structure into account. Further, some semi-supervised clustering algorithms were aimed to improve the clustering accuracy under the supervision of a limited amount of labeled data. Kernel based approaches, such as, kernel based c-means method [8] 1 , kernel-based fuzzy c-means method, semi- supervised kernel fuzzy c-means method [9], etc., have been successfully used to deal with classification and clustering 1 Some authors call the k-means clustering method as the c-means clustering method. 978-1-4244-9477-4/11/$26.00 ©2011 IEEE 545