Rep. Fac. Sci. Engrg. Reports of the Faculty of Science and Engineering,
Saga Univ. Saga University, Vol. 36, No.1, 2007
36-1 (2007)、33-38
Automatic shape independent clustering with global optimum cluster
determinations
By
Kohei Arai* and Ali Ridho Barakbah**
Abstract: A new method which allows identifying any shape of cluster patterns in case of numerical
clustering is proposed. The method is based on the iterative clustering construction utilizing a nearest
neighbor distance between clusters to merge. The method differs from other techniques of which the
cluster density is determined based on calculating the variance factors. The cluster density proposed
here is, on the other hand, determined with a total distance within cluster that derived from a total
distance of merged cluster and the distance between merged clusters in the previous stage of cluster
construction. Thus, the whole density for each stage can be determined by a calculated average of a total
density within cluster of each cluster, and then split by referring the maximum furthest distance between
clusters at that stage. Beside this, this paper also proposes a technique for finding a global optimum of
cluster construction. Experimental results show how effective the proposed clustering method is for a
complicated shape of the cluster structure
Key words: Single linkage hierarchical clustering method; Cluster density; Shape independent
clustering; Automatic clustering
1. Introduction
For many years, many clustering algorithms have
been proposed and widely used. It can be divided into
two categories, hierarchical and non-hierarchical
methods. It is commonly used in many fields, such as
data mining, pattern recognition, image classification,
biological sciences, marketing, city-planning, document
retrieval, etc. The clustering means a process to define a
mapping f:DニC from some data D={d
1
, d
2
,…,d
n
} to
some clusters C={c
1
, c
2
,…, c
n
} based on similarity
between d
i
.
The task of finding a good cluster is very critical
issues in clustering. Cluster analysis constructs good
clusters when the members of a cluster have a high
degree of similarity to each other (internal homogeneity)
and are not like members of other clusters (external
homogeneity) [2,8].
In fact, most authors find difficulty in describing
clustering without some grouping criteria. For example,
the objects are clustered or grouped on the basis of
maximizing the inter-cluster similarity and minimizing
the intra-cluster similarity [8]. One of the methods to
define a good cluster is variance constraint [6] that
Received on Apr.28 2007
*Department of Information Science
**EEPIS:Electric Engineering Politechnique in Surabaya
©Faculty of Science and Engineering, Saga University
calculates the cluster density with variance within cluster
(v
w
) and variance between clusters (v
b
) [3,12]. The ideal
cluster, in this case, has minimum v
w
to express internal
homogeneity and maximum v
b
to express external
homogeneity.
The parameter of v
w
and v
b
, however, can just be
applied for identifying condensed clustering cases, which
are the cluster members gathered in surrounding values
so that the centroid resides in the circle weight of the
members. Therefore, v
w
and v
b
can not be used in shape
independent clustering, such as convex shape clustering.
One of the most famous clustering methods is
hierarchical clustering. In hierarchical clustering the data
are not partitioned into a particular cluster in at the first
step. It runs with making a single cluster that has
similarity, and then continues iteratively.
Hierarchical clustering algorithms can be either
agglomerative or divisive [4,9,11]. Agglomerative
method proceeds by series of fusions of the “n” similar
objects into groups, and divisive method, which separate
“n” objects successively into finer groupings.
Agglomerative techniques are more commonly used.
One of similarity factors between objects in
hierarchical methods is a single link that similarity
closely related to the smallest distance between objects
[1]. Therefore, it is called single linkage clustering
method. Euclidian distance is commonly used to
calculate the distance in case of numerical data sets [11].
For two-dimensional dataset, it performs as: