1 A Novel Validity Measure for Clusters of Arbitrary Shapes and Densities Noha A.Yousri 1,2 , Mohamed S. Kamel 1 , Mohamed A. Ismail 2 1 PAMI lab, Electrical and Computer Engineering, University of Waterloo, Ontario, Canada 2 Computer and Systems Engineering, University of Alexandria, Egypt. {nyousri, mkamel}@pami.uwaterloo.ca, nyousri@alexeng.edu.eg, maismail@pua.edu.eg Abstract Several validity indices have been designed to evaluate solutions obtained by clustering algorithms. Traditional indices are generally designed to evaluate center-based clustering, where clusters are assumed to be of globular shapes with defined centers or representatives. Therefore they are not suitable to evaluate clusters of arbitrary shapes and densities, where clusters have no defined centers or representatives, but formed based on the connectivity of patterns to their neighbours. In this work, a novel validity measure based on a density-based criterion is proposed. It is based on the concept that densities of clusters can be distinguished by the neighbourhood distances between patterns. It is suitable for clusters of any shapes and of different densities. The main concepts of the proposed measure are explained and experimental results that support the proposed measure are given. 1. Introduction Validity indices are used to evaluate a clustering solution according to specified measures that depend on measuring the proximity between patterns. The basic assumption in building a validity index is that patterns should be more similar to patterns in their cluster compared to other patterns outside the cluster. This concept has lead to the foundation of homogeneity and separateness measures, where homogeneity refers to the similarity between patterns of the same cluster, and separateness refers to the dissimilarity between patterns of different clusters. However, each of homogeneity and separateness measures could have different forms depending on the clustering assumptions; some clustering methods assume that patterns of the same cluster will group around a centroid or a representative, others will drop this assumption to the more general one that there are no specific centroids or representatives, and that patterns connect together to form a cluster. In the more popular methods of clustering as K- Means, average and complete linkage clustering [1], as well as related algorithms as BIRCH [2], PAM [3], and CLARANS [4], the clustering preserves the common assumption about a globular shape of a cluster, where in this case a cluster representative can be easily defined. Since validity indices were built for the most common used algorithms, the resulting indices defined homogeneity and separateness in the presence of centroids, examples are Dunn’s, Xie-Beni, and Davies Bouldin indices. However, moving from the traditional assumption of the globular shaped clusters, into the more general problem of having undefined geometrical cluster shapes, other non-traditional algorithms were developed as DBScan [5], DenClue [6], Shared Nearest Neighbor [7], Chameleon [8], and Mitosis [9]. Those algorithms are able to find clusters of arbitrary shapes, and where it is difficult to use cluster representatives as the case in globular-shaped clusters. In order to develop a validity measure for the more general problem of undefined cluster shapes other considerations for measuring homogeneity and separateness should follow. In this work, a validity measure that considers the general problem of arbitrary shape and arbitrary density clusters is proposed. It is based on minimizing the standard deviation of the minimum spanning tree (MST) distances of the cluster, as a homogeneity measure, and minimizing the number of neighborhoods that mix patterns from different clusters, as a separateness measure. 2. Related Work Several validity indices have been proposed in the literature, some of which are external indices as the 978-1-4244-2175-6/08/$25.00 ©2008 IEEE