A Fast Approximate Kernel k-means Clustering
Method For Large Data sets
T. Hitendra Sarma and P. Viswanath
Department of Computer Science and Engineering
Rajeev Gandhi Memorial College of Eng. and Technology
Nandyal-518501, A.P., India.
Email: {hitendrasarma, viswanath.p}@ieee.org
B. Eswara Reddy
Department of Computer Science and Engineering
JNTUA College of Engineering
Anantapur-515002, A.P., India
Email:eswarcsejntu@gmail.com
Abstract—In unsupervised classification, kernel k-means clus-
tering method has been shown to perform better than conven-
tional k-means clustering method in identifying non-isotropic
clusters in a data set. The space and time requirements of
this method are O(n
2
), where n is the data set size. The
paper proposes a two stage hybrid approach to speed-up the
kernel k-means clustering method. In the first stage, the data
set is divided in to a number of group-lets by employing a
fast clustering method called leaders clustering method. Each
group-let is represented by a prototype called its leader. The
set of leaders, which depends on a threshold parameter, can
be derived in O(n) time. The paper presents a modification to
the leaders clustering method where group-lets are found in the
kernel space (not in the input space), but are represented by
leaders in the input space. In the second stage, kernel k-means
clustering method is applied with the set of leaders to derive a
partition of the set of leaders. Finally, each leader is replaced by
its group to get a partition of the data set. The proposed method
has time complexity of O(n + p
2
), where p is the leaders set size.
Its space complexity is also O(n + p
2
). The proposed method can
be easily implemented. Experimental results shows that, with a
small loss of quality, the proposed method can significantly reduce
the time taken than the conventional kernel k-means clustering
method.
I. I NTRODUCTION
Data clustering is a process of identifying the natural
groupings that exists in a given data set, such that the objects
in the same cluster are more similar and the objects in different
clusters are less similar. It has been considered as an important
tool in various applications like pattern recognition, image
processing, data mining, remote sensing, statistics, etc., [1].
Clusters in the given data may be of different types, such
as, isotropic, non-isotropic, linearly separable, non-linearly
separable, etc. It has been observed that, when, data sets
have isotropic and linearly separable clusters, sum-of-squares
based partitioning methods, like k-means clustering method,
are effective. On the other hand, kernel based clustering
methods, like kernel k-means clustering method, are proved
to be effective to identify clusters which are non-isotropic and
linearly inseparable in the input space [2] [3].
Girolami [3] first proposed the kernel k-means clustering
method. It is an iterative method. It first maps the data points
from the input space to higher dimensional feature space
through a non linear transformation φ(·) and then minimizes
the clustering error in that feature space. The distance between
a data point and a cluster center in the feature space can
be computed using a kernel function without knowing the
explicit form of the transformation [4]. This is, because, the dot
product between two data points x and y in the feature space,
which is φ(x) · φ(y), can be computed as a function k(x, y),
where k : D × D → R is called the kernel function. This is
often known as the kernel trick and is valid for transformations
that satisfies Mercer’s conditions [5]. Some standard kernel
functions are given below.
• Polynomial kernel, k(x
i
,x
j
)=(x
i
· x
j
+ 1)
d
,
• Radial (RBF) kernel, k(x
i
,x
j
)= exp(-r||x
i
- x
j
||
2
),
• Neural kernel, k(x
i
,x
j
)= tanh(ax
i
· x
j
+ b),
where a, b and d are positive constants.
For two arbitrary data points x
i
and x
j
, very often, in the
iterative process of clustering, k(x
i
,x
j
) is needed. So, a matrix
called kernel matrix K =[k
ij
]
n×n
is found, where the (i, j )
th
entry is k
ij
= k(x
i
,x
j
). Here n is the data set size. The
kernel matrix is precomputed and stored. So, the time and
space requirements (which is given, in detail, in later sections)
are O(n
2
). This is the drawback of kernel k-means clustering
method and because of which it is not suitable when the data
set size is large. Other drawbacks being, (i) the number of
clusters, k, should be given as input to the method, and (ii)
the result is sensitive to the initial seed points, and the solution
found may not be the optimal one because of the local minima
problem.
Several improvements are proposed to cater these draw-
backs. In order to overcome the local minima problem, Likas
et al. proposed The global kernel k-means method [6], which
produces a final partition which is independent of the initial
seed points. soft geodesic kernel k-means [7] method improves
the quality of the clustering result by taking the internal data
manifold structure into account. Further, some semi-supervised
clustering algorithms were aimed to improve the clustering
accuracy under the supervision of a limited amount of labeled
data. Kernel based approaches, such as, kernel based c-means
method [8]
1
, kernel-based fuzzy c-means method, semi-
supervised kernel fuzzy c-means method [9], etc., have been
successfully used to deal with classification and clustering
1
Some authors call the k-means clustering method as the c-means clustering
method.
978-1-4244-9477-4/11/$26.00 ©2011 IEEE
545