Determining the Optimal Number of Clusters
Using a New Evolutionary Algorithm
Wei Lu and Issa Traore
Department of Electrical and Computer Engineering
University of Victoria
{wlu, itraore}@ece.uvic.ca
Abstract
Estimating the optimal number of clusters for a
dataset is one of the most essential issues in cluster
analysis. An improper pre-selection for the number of
clusters might easily lead to bad clustering outcome. In
this paper, we propose a new evolutionary algorithm
to address this issue. Specifically, the proposed
evolutionary algorithm defines a new entropy-based
fitness function, and three new genetic operators for
splitting, merging, and removing clusters. Empirical
evaluations using the synthetic dataset and an existing
benchmark show that the proposed evolutionary
algorithm can exactly estimate the optimal number of
clusters for a set of data.
1. Introduction
Identifying the optimal number of clusters for a set
of data is essential for effective and efficient data
clustering. For instance, a clustering algorithm such as
k-means may generate a bad clustering result if initial
partitions are not properly chosen, a situation that
occurs often and is not an obvious task. Another
popular clustering approach sensitive to this problem is
based on Gaussian mixture model (GMM). GMM is
based on the assumption that the data to be clustered
are drawn from one of several Gaussian distributions
and it was suggested that Gaussian mixture distribution
could approximate any distribution up to arbitrary
accuracy, as long as a sufficient number of components
are used [1]. A common approach for estimating the
parameters of GMM is Expectation-Maximization
(EM) algorithm [2].
Previous attempts for estimating the number of
mixing components for the GMM are mainly based on
statistical techniques. Some examples of these previous
works were suggested in the literature [3] and [4].
Although the previous approaches based on statistical
techniques have proved their ability to estimate the
optimal number of clusters, they are prone to converge
into local optima since they usually stop to perform
further search when the corresponding criterions reach
certain thresholds. In contrast, the evolutionary
computation schemes have the inherent potential
capability to escape from local maximum since their
search space for optimal solutions can be extended by
genetic operations and optimization selection. Based
on this, in this paper we tackle the issue of number of
clusters estimation using a novel evolutionary
approach, which combines the Gaussian mixture and
the EM algorithm.
The rest of the paper is structured as follows.
Section 2 illustrates the evolutionary algorithm for
determining the optimal number of clusters for a set of
data. Section 3 presents and discusses the results
obtained in the empirical validation of the algorithm.
2. Evolutionary algorithm
2.1. Evolutionary entities
2.1.1. Representation of evolutionary individuals.
EM algorithm generates an estimate for the set of
parameters } , , {
i i i
σ μ α and posterior probabilities
p(i|x
n
). The posterior probability describes the
likelihood that the data pattern x
n
approximates to a
specified Gaussian component i. Each data x
n
is
assigned to the corresponding Gaussian component i
according to p(i|x
n
) and final clustering results are
statistically represented by the set of parameters
} , , {
i i i
σ μ α , also evolutionary individuals.
2.1.2. Evolutionary operators. During the evolution,
we need sometimes to split components, as well as to
merge or delete them. Therefore we propose three new
evolutionary operators called splitting, merging and
deletion operators, used as their names indicate for
splitting, merging and removing the components. The
detailed definitions of the operators can be found in
[5].
Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’05)
1082-3409/05 $20.00 © 2005 IEEE