Abstract—Most of fuzzy clustering algorithms have some
discrepancies, e.g. they are not able to detect clusters with convex
shapes, the number of the clusters should be a priori known, they
suffer from numerical problems, like sensitiveness to the
initialization, etc. This paper studies the synergistic combination of
the hierarchical and graph theoretic minimal spanning tree based
clustering algorithm with the partitional Gath-Geva fuzzy clustering
algorithm. The aim of this hybridization is to increase the robustness
and consistency of the clustering results and to decrease the number
of the heuristically defined parameters of these algorithms to
decrease the influence of the user on the clustering results. For the
analysis of the resulted fuzzy clusters a new fuzzy similarity measure
based tool has been presented. The calculated similarities of the
clusters can be used for the hierarchical clustering of the resulted
fuzzy clusters, which information is useful for cluster merging and
for the visualization of the clustering results. As the examples used
for the illustration of the operation of the new algorithm will show,
the proposed algorithm can detect clusters from data with arbitrary
shape and does not suffer from the numerical problems of the
classical Gath-Geva fuzzy clustering algorithm.
Keywords—Clustering, fuzzy clustering, minimal spanning tree,
cluster validity, fuzzy similarity.
I. INTRODUCTION
AST and robust clustering algorithms play an important
role in extracting useful information from large databases.
The aim of cluster analysis is to partition a set of N objects in
c clusters such that objects within clusters should be similar to
each other and objects in different clusters should be
dissimilar from each other. Clustering can be used to quantize
the available data, to extract a set of cluster prototypes for the
compact representation of the dataset, to select the relevant
features, to segment the dataset into homogenous subsets, and
to initialize regression and classification models.
There are two main approaches in the clustering: Hard
clustering algorithms allocate each object to a single cluster
during their operation and in its output. Fuzzy clustering
methods assign degrees of membership in several clusters to
each input pattern. So, the fuzzy clustering methods result
more dynamic separation of the patterns.
Manuscript received October 18, 2005.
Ágnes Vathy-Fogarassy, University of Veszprém, Department of
Mathematics an Computing Science, P.O. Box 158, Veszprém, H-8201
Hungary (e-mail: vathya@almos.vein.hu).
Balázs Feil, University of Veszprém, Department of Process Engineering,
P.O. Box 158, Veszprém, H-8201 Hungary (e-mail: feilb@fmt.vein.hu).
János Abonyi, University of Veszprém, Department of Process
Engineering, P.O. Box 158, Veszprém, H-8201 Hungary (e-mail:
abonyij@fmt.vein.hu).
In the literature a wide variety of algorithms (partitional,
hierarchical, density-based, graph-based, model-based, etc.)
have been proposed, but it is a difficult challenge to find a
general and powerful method that is quite robust and that does
not require the fine-tuning of the user. Most of these
algorithms have some discrepancies.
For example the basic partitional methods are not able to
detect convex clusters; when using hierarchical methods the
number of the clusters should be a priori known, and they are
not efficient enough for large datasets; while linkage-based
methods often suffer from the chaining effect. A problem
accompanying the use of a partitional algorithm is that the
number of the desired clusters should be given in advance.
The partitional techniques usually produce clusters by
optimizing a criterion function defined either locally (on a
subset of the patterns) or globally (defined over all of the
patterns). Generally, different cluster shapes (orientations,
volumes) are required for the different clusters (partitions),
but there is no guideline as to how to choose them a priori.
The norm-inducing matrix of the cluster prototypes can be
adapted by using estimates of the data covariance, and can be
used to estimate the statistical dependence of the data in each
cluster. The Gaussian mixture based fuzzy maximum
likelihood estimation algorithm (Gath-Geva algorithm (GG))
is based on such an adaptive distance measure, it can adapt the
distance norm to the underlying distribution of the data which
is reflected in the different sizes of the clusters, hence it is able
to detect clusters with different orientation and volume.
Unfortunately the GG algorithm is very sensitive to
initialization, hence often it cannot be directly applied to the
data.
The hierarchical clustering approaches are related to graph-
theoretic clustering. These algorithms are able to detect
clusters of various shapes and sizes, and they do not require
initialization. One of the best-known graph-based divisive
clustering algorithm is based on the construction of the
minimal spanning tree (MST) of the objects [3,7,9,13,16]. By
the elimination of any edge from the MST we get subtrees
which correspond to clusters. Clustering methods using a
minimal spanning tree take advantages of the MST. The MST
ignores many possible connections between the data patterns,
so the cost of clustering can be decreased. Single-link clusters
are subgraphs of the minimum spanning tree of the data
[10,11] which are also the connected components. Complete-
link clusters are maximal complete subgraphs, and are related
to the node colorability of graphs [2]. The maximal complete
subgraph was considered the strictest definition of a cluster in
[1,15]. Clustering, as an unsupervised learning, is mainly
Minimal Spanning Tree based Fuzzy Clustering
Ágnes Vathy-Fogarassy, Balázs Feil, and János Abonyi
F
PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY VOLUME 8 OCTOBER 2005 ISSN 1307-6884
PWASET VOLUME 8 OCTOBER 2005 ISSN 1307-6884 7 © 2005 WASET.ORG