A Clustering Algorithm based on Local Accumulative Knowledge Yu Zong 1,2 Ping Jin 1* 1 West Anhui University, Luan, China 2 University of Science and Technology of China Hefei, China Email: Nick.zongy@gmail.com, Jinping@wxc.edu.cn Guandong Xu Victoria University, Melbourne, Australia Email: Guandongxu@vu.edu.cn Rong Pan Aalborg University, DK-9220, Denmark Abstract—Clustering as an important unsupervised learning technique is widely used to discover the inherent structure of a given data set. For clustering is depended on applications, researchers use different models to defined clustering problems. Heuristic clustering algorithm is an efficient way to deal with clustering problem defined by combining optimization model, but initialization sensitivity is an inevitable problem. In the past decades, a lot of methods have been proposed to deal with such problem. In this paper, on the contrary, we take the advantage of the initialization sensitivity to design a new clustering algorithm. We, firstly, run K-means, a widely used heuristic clustering algorithm, on data set for multiple times to generate several clustering results; secondly, propose a structure named Local Accumulative Knowledge (LAKE) to capture the common information of clustering results; thirdly, execute the Single-linkage algorithm on LAKE to generate a rough clustering result; eventually, assign the rest data objects to the corresponding clusters. Experimental results on synthetic and real world data sets demonstrate the superiority of the proposed approach in terms of clustering quality measures. Index Terms—Clustering, Local accumulative knowle- dge, Heuristic algorithm I. INTRODUCTION Clustering is a useful approach in data mining processes for identifying patterns and revealing underlying knowledge from large data collections. The application areas of clustering include image segmentation, information retrieval, and document classification, associate rule mining, web usage tracking and transaction analysis. Since such technique is required everywhere and the inductive procedure follows a variety of principle, clustering is always an active research topic in various areas [1]. As clustering is an important technology related to applications, researches use different models to define clustering problem and propose different ways to deal with the models. In this paper, we focus on the clustering problem which is defined by combining optimization models described as following: Given a set of input data set 1 2 { , ,..., } N D xx x = , where , 1,... d i x R i N ∈ = . Clustering algorithm attempts to seek K partitions of D , 1 2 { , ,..., } K C CC C = ( K N ≤ ), such that the quality measure function 1 , () (, ) K i j k x C x C i k j k QC dist x x = ∈ ∈ = ∑∑ is minimized, where () dist is the distance function between data objects. Drineas et al have proved that this problem is NP-hard [2]. Clustering algorithm, which uses the traverse method, such as PAM [3], can not deal with large data sets. In order to deal with this kind of clustering problem, researchers introduce local search methods and devise a lot of heuristic clustering algorithms. Figure 1 gives an example of local search method. From Figure 1, we can find that a local search starting from Init 1 will converge to Solution 1, on the contrary, a local search starting from Init 2 will converge to a different Solution 2. This example indicates the fact that the local search method is sensitive to the initialization, that is, different starting points will converge to different results. Due to the essential fault of local search, initialization sensitivity problem becomes an inevitable problem of heuristic clustering algorithm. In this paper, we make use of the initialization sensitivity of heuristic clustering algorithm for developing a novel clustering algorithm. The idea behind the proposed approach is to utilize each finding of clustering algorithm as once learned knowledge from data set, and then accumulate the findings of multiple clustering executions with different initializations as the final optimal solution of clustering. The major contributions of this paper are as follow: • We define a structure called Local Accumulative KnowlEdge (LAKE) to capture common parts of the clustering results. • We propose a Fast algorithm to find out LAKE JOURNAL OF COMPUTERS, VOL. 8, NO. 2, FEBRUARY 2013 365 © 2013 ACADEMY PUBLISHER doi:10.4304/jcp.8.2.365-371