Rep. Fac. Sci. Engrg. Reports of the Faculty of Science and Engineering, Saga Univ. Saga University, Vol. 36, No.1, 2007 36-1 (2007)33-38 Automatic shape independent clustering with global optimum cluster determinations By Kohei Arai* and Ali Ridho Barakbah** Abstract: A new method which allows identifying any shape of cluster patterns in case of numerical clustering is proposed. The method is based on the iterative clustering construction utilizing a nearest neighbor distance between clusters to merge. The method differs from other techniques of which the cluster density is determined based on calculating the variance factors. The cluster density proposed here is, on the other hand, determined with a total distance within cluster that derived from a total distance of merged cluster and the distance between merged clusters in the previous stage of cluster construction. Thus, the whole density for each stage can be determined by a calculated average of a total density within cluster of each cluster, and then split by referring the maximum furthest distance between clusters at that stage. Beside this, this paper also proposes a technique for finding a global optimum of cluster construction. Experimental results show how effective the proposed clustering method is for a complicated shape of the cluster structure Key words: Single linkage hierarchical clustering method; Cluster density; Shape independent clustering; Automatic clustering 1. Introduction For many years, many clustering algorithms have been proposed and widely used. It can be divided into two categories, hierarchical and non-hierarchical methods. It is commonly used in many fields, such as data mining, pattern recognition, image classification, biological sciences, marketing, city-planning, document retrieval, etc. The clustering means a process to define a mapping f:DC from some data D={d 1 , d 2 ,…,d n } to some clusters C={c 1 , c 2 ,…, c n } based on similarity between d i . The task of finding a good cluster is very critical issues in clustering. Cluster analysis constructs good clusters when the members of a cluster have a high degree of similarity to each other (internal homogeneity) and are not like members of other clusters (external homogeneity) [2,8]. In fact, most authors find difficulty in describing clustering without some grouping criteria. For example, the objects are clustered or grouped on the basis of maximizing the inter-cluster similarity and minimizing the intra-cluster similarity [8]. One of the methods to define a good cluster is variance constraint [6] that Received on Apr.28 2007 *Department of Information Science **EEPIS:Electric Engineering Politechnique in Surabaya ©Faculty of Science and Engineering, Saga University calculates the cluster density with variance within cluster (v w ) and variance between clusters (v b ) [3,12]. The ideal cluster, in this case, has minimum v w to express internal homogeneity and maximum v b to express external homogeneity. The parameter of v w and v b , however, can just be applied for identifying condensed clustering cases, which are the cluster members gathered in surrounding values so that the centroid resides in the circle weight of the members. Therefore, v w and v b can not be used in shape independent clustering, such as convex shape clustering. One of the most famous clustering methods is hierarchical clustering. In hierarchical clustering the data are not partitioned into a particular cluster in at the first step. It runs with making a single cluster that has similarity, and then continues iteratively. Hierarchical clustering algorithms can be either agglomerative or divisive [4,9,11]. Agglomerative method proceeds by series of fusions of the “n” similar objects into groups, and divisive method, which separate “n” objects successively into finer groupings. Agglomerative techniques are more commonly used. One of similarity factors between objects in hierarchical methods is a single link that similarity closely related to the smallest distance between objects [1]. Therefore, it is called single linkage clustering method. Euclidian distance is commonly used to calculate the distance in case of numerical data sets [11]. For two-dimensional dataset, it performs as: