K. Deb et al. (Eds.): GECCO 2004, LNCS 3103, pp. 162–173, 2004.
© Springer-Verlag Berlin Heidelberg 2004
Clustering with Niching Genetic K-means Algorithm
Weiguo Sheng, Allan Tucker, and Xiaohui Liu
Department of Information System and Computing
Brunel University, Uxbridge, Middlesex, UB8 3PH
London, UK
{weiguo.sheng, allan.tucker, xiaohui.liu}@brunel.ac.uk
Abstract. GA-based clustering algorithms often employ either simple GA,
steady state GA or their variants and fail to consistently and efficiently identify
high quality solutions (best known optima) of given clustering problems, which
involve large data sets with many local optima. To circumvent this problem, we
propose Niching Genetic K-means Algorithm (NGKA) that is based on modi-
fied deterministic crowding and embeds the computationally attractive k-means.
Our experiments show that NGKA can consistently and efficiently identify high
quality solutions. Experiments use both simulated and real data with varying
size and varying number of local optima. The significance of NGKA is also
shown on the experimental data sets by comparing through simulations with
Genetically Guided Algorithm (GGA) and Genetic K-means Algorithm (GKA).
1 Introduction
Clustering is useful in exploratory data analysis. Cluster analysis organizes data by
grouping individuals in a population in order to discover structure or clusters in the
data. In some sense, we would like the individuals within a group to be similar to one
another, but dissimilar from individuals in other groups. Various types of clustering
algorithms have been proposed to suit different requirements. For clustering large
data sets, there is a general consensus that partitional algorithms are imperative. Par-
titional clustering algorithms generate a single partitioning, with a specified or esti-
mated number of clusters of the data in an attempt to recover natural groups present
in the data. Among the partitional clustering algorithms, the k-means [5] has been
popularly used because of its simplicity and efficiency. However, it highly depends
on the initial choice of cluster centers and may end up in a local optimum.
A possible way to deal with local optimality of clustering problems is to use sto-
chastic optimization schemes, such as Genetic Algorithms (GAs), which are believed
more insensitive to initial conditions. There have been many attempts to use GAs for
clustering problems. Roughly, the attempts can be classified as GA approaches such
as [12,4,7] and hybrid GA approaches such as [15,9]. In most cases, the above ap-
proaches are reported performing well on small data sets with few local optima. How-
ever, real clustering problems may involve large data sets with many local optima. On
such clustering problems, both GA and hybrid GA approaches can run into problems.
First, they have difficulties in consistently identifying high quality solutions mainly
because they employ either the Simple GA (SGA) [6], the Steady State GA (SSGA)