K. Deb et al. (Eds.): GECCO 2004, LNCS 3103, pp. 162–173, 2004. © Springer-Verlag Berlin Heidelberg 2004 Clustering with Niching Genetic K-means Algorithm Weiguo Sheng, Allan Tucker, and Xiaohui Liu Department of Information System and Computing Brunel University, Uxbridge, Middlesex, UB8 3PH London, UK {weiguo.sheng, allan.tucker, xiaohui.liu}@brunel.ac.uk Abstract. GA-based clustering algorithms often employ either simple GA, steady state GA or their variants and fail to consistently and efficiently identify high quality solutions (best known optima) of given clustering problems, which involve large data sets with many local optima. To circumvent this problem, we propose Niching Genetic K-means Algorithm (NGKA) that is based on modi- fied deterministic crowding and embeds the computationally attractive k-means. Our experiments show that NGKA can consistently and efficiently identify high quality solutions. Experiments use both simulated and real data with varying size and varying number of local optima. The significance of NGKA is also shown on the experimental data sets by comparing through simulations with Genetically Guided Algorithm (GGA) and Genetic K-means Algorithm (GKA). 1 Introduction Clustering is useful in exploratory data analysis. Cluster analysis organizes data by grouping individuals in a population in order to discover structure or clusters in the data. In some sense, we would like the individuals within a group to be similar to one another, but dissimilar from individuals in other groups. Various types of clustering algorithms have been proposed to suit different requirements. For clustering large data sets, there is a general consensus that partitional algorithms are imperative. Par- titional clustering algorithms generate a single partitioning, with a specified or esti- mated number of clusters of the data in an attempt to recover natural groups present in the data. Among the partitional clustering algorithms, the k-means [5] has been popularly used because of its simplicity and efficiency. However, it highly depends on the initial choice of cluster centers and may end up in a local optimum. A possible way to deal with local optimality of clustering problems is to use sto- chastic optimization schemes, such as Genetic Algorithms (GAs), which are believed more insensitive to initial conditions. There have been many attempts to use GAs for clustering problems. Roughly, the attempts can be classified as GA approaches such as [12,4,7] and hybrid GA approaches such as [15,9]. In most cases, the above ap- proaches are reported performing well on small data sets with few local optima. How- ever, real clustering problems may involve large data sets with many local optima. On such clustering problems, both GA and hybrid GA approaches can run into problems. First, they have difficulties in consistently identifying high quality solutions mainly because they employ either the Simple GA (SGA) [6], the Steady State GA (SSGA)