1
A Novel Validity Measure for Clusters of Arbitrary Shapes and Densities
Noha A.Yousri
1,2
, Mohamed S. Kamel
1
, Mohamed A. Ismail
2
1
PAMI lab, Electrical and Computer Engineering, University of Waterloo, Ontario, Canada
2
Computer and Systems Engineering, University of Alexandria, Egypt.
{nyousri, mkamel}@pami.uwaterloo.ca, nyousri@alexeng.edu.eg, maismail@pua.edu.eg
Abstract
Several validity indices have been designed to
evaluate solutions obtained by clustering algorithms.
Traditional indices are generally designed to evaluate
center-based clustering, where clusters are assumed to
be of globular shapes with defined centers or
representatives. Therefore they are not suitable to
evaluate clusters of arbitrary shapes and densities,
where clusters have no defined centers or
representatives, but formed based on the connectivity
of patterns to their neighbours.
In this work, a novel validity measure based on a
density-based criterion is proposed. It is based on the
concept that densities of clusters can be distinguished
by the neighbourhood distances between patterns. It is
suitable for clusters of any shapes and of different
densities. The main concepts of the proposed measure
are explained and experimental results that support
the proposed measure are given.
1. Introduction
Validity indices are used to evaluate a clustering
solution according to specified measures that depend
on measuring the proximity between patterns. The
basic assumption in building a validity index is that
patterns should be more similar to patterns in their
cluster compared to other patterns outside the cluster.
This concept has lead to the foundation of
homogeneity and separateness measures, where
homogeneity refers to the similarity between patterns
of the same cluster, and separateness refers to the
dissimilarity between patterns of different clusters.
However, each of homogeneity and separateness
measures could have different forms depending on the
clustering assumptions; some clustering methods
assume that patterns of the same cluster will group
around a centroid or a representative, others will drop
this assumption to the more general one that there are
no specific centroids or representatives, and that
patterns connect together to form a cluster.
In the more popular methods of clustering as K-
Means, average and complete linkage clustering [1], as
well as related algorithms as BIRCH [2], PAM [3],
and CLARANS [4], the clustering preserves the
common assumption about a globular shape of a
cluster, where in this case a cluster representative can
be easily defined. Since validity indices were built for
the most common used algorithms, the resulting
indices defined homogeneity and separateness in the
presence of centroids, examples are Dunn’s, Xie-Beni,
and Davies Bouldin indices.
However, moving from the traditional assumption of
the globular shaped clusters, into the more general
problem of having undefined geometrical cluster
shapes, other non-traditional algorithms were
developed as DBScan [5], DenClue [6], Shared
Nearest Neighbor [7], Chameleon [8], and Mitosis [9].
Those algorithms are able to find clusters of arbitrary
shapes, and where it is difficult to use cluster
representatives as the case in globular-shaped clusters.
In order to develop a validity measure for the more
general problem of undefined cluster shapes other
considerations for measuring homogeneity and
separateness should follow.
In this work, a validity measure that considers the
general problem of arbitrary shape and arbitrary
density clusters is proposed. It is based on minimizing
the standard deviation of the minimum spanning tree
(MST) distances of the cluster, as a homogeneity
measure, and minimizing the number of
neighborhoods that mix patterns from different
clusters, as a separateness measure.
2. Related Work
Several validity indices have been proposed in the
literature, some of which are external indices as the
978-1-4244-2175-6/08/$25.00 ©2008 IEEE