Determining the Optimal Number of Clusters Using a New Evolutionary Algorithm Wei Lu and Issa Traore Department of Electrical and Computer Engineering University of Victoria {wlu, itraore}@ece.uvic.ca Abstract Estimating the optimal number of clusters for a dataset is one of the most essential issues in cluster analysis. An improper pre-selection for the number of clusters might easily lead to bad clustering outcome. In this paper, we propose a new evolutionary algorithm to address this issue. Specifically, the proposed evolutionary algorithm defines a new entropy-based fitness function, and three new genetic operators for splitting, merging, and removing clusters. Empirical evaluations using the synthetic dataset and an existing benchmark show that the proposed evolutionary algorithm can exactly estimate the optimal number of clusters for a set of data. 1. Introduction Identifying the optimal number of clusters for a set of data is essential for effective and efficient data clustering. For instance, a clustering algorithm such as k-means may generate a bad clustering result if initial partitions are not properly chosen, a situation that occurs often and is not an obvious task. Another popular clustering approach sensitive to this problem is based on Gaussian mixture model (GMM). GMM is based on the assumption that the data to be clustered are drawn from one of several Gaussian distributions and it was suggested that Gaussian mixture distribution could approximate any distribution up to arbitrary accuracy, as long as a sufficient number of components are used [1]. A common approach for estimating the parameters of GMM is Expectation-Maximization (EM) algorithm [2]. Previous attempts for estimating the number of mixing components for the GMM are mainly based on statistical techniques. Some examples of these previous works were suggested in the literature [3] and [4]. Although the previous approaches based on statistical techniques have proved their ability to estimate the optimal number of clusters, they are prone to converge into local optima since they usually stop to perform further search when the corresponding criterions reach certain thresholds. In contrast, the evolutionary computation schemes have the inherent potential capability to escape from local maximum since their search space for optimal solutions can be extended by genetic operations and optimization selection. Based on this, in this paper we tackle the issue of number of clusters estimation using a novel evolutionary approach, which combines the Gaussian mixture and the EM algorithm. The rest of the paper is structured as follows. Section 2 illustrates the evolutionary algorithm for determining the optimal number of clusters for a set of data. Section 3 presents and discusses the results obtained in the empirical validation of the algorithm. 2. Evolutionary algorithm 2.1. Evolutionary entities 2.1.1. Representation of evolutionary individuals. EM algorithm generates an estimate for the set of parameters } , , { i i i σ μ α and posterior probabilities p(i|x n ). The posterior probability describes the likelihood that the data pattern x n approximates to a specified Gaussian component i. Each data x n is assigned to the corresponding Gaussian component i according to p(i|x n ) and final clustering results are statistically represented by the set of parameters } , , { i i i σ μ α , also evolutionary individuals. 2.1.2. Evolutionary operators. During the evolution, we need sometimes to split components, as well as to merge or delete them. Therefore we propose three new evolutionary operators called splitting, merging and deletion operators, used as their names indicate for splitting, merging and removing the components. The detailed definitions of the operators can be found in [5]. Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05) 1082-3409/05 $20.00 © 2005 IEEE