A Clustering Algorithm based on Local
Accumulative Knowledge
Yu Zong
1,2
Ping Jin
1*
1
West Anhui University, Luan, China
2
University of Science and Technology of China Hefei, China
Email: Nick.zongy@gmail.com, Jinping@wxc.edu.cn
Guandong Xu
Victoria University, Melbourne, Australia
Email: Guandongxu@vu.edu.cn
Rong Pan
Aalborg University, DK-9220, Denmark
Abstract—Clustering as an important unsupervised
learning technique is widely used to discover the
inherent structure of a given data set. For clustering is
depended on applications, researchers use different
models to defined clustering problems. Heuristic
clustering algorithm is an efficient way to deal with
clustering problem defined by combining optimization
model, but initialization sensitivity is an inevitable
problem. In the past decades, a lot of methods have
been proposed to deal with such problem. In this
paper, on the contrary, we take the advantage of the
initialization sensitivity to design a new clustering
algorithm. We, firstly, run K-means, a widely used
heuristic clustering algorithm, on data set for multiple
times to generate several clustering results; secondly,
propose a structure named Local Accumulative
Knowledge (LAKE) to capture the common
information of clustering results; thirdly, execute the
Single-linkage algorithm on LAKE to generate a
rough clustering result; eventually, assign the rest
data objects to the corresponding clusters.
Experimental results on synthetic and real world data
sets demonstrate the superiority of the proposed
approach in terms of clustering quality measures.
Index Terms—Clustering, Local accumulative knowle-
dge, Heuristic algorithm
I. INTRODUCTION
Clustering is a useful approach in data mining
processes for identifying patterns and revealing
underlying knowledge from large data collections. The
application areas of clustering include image
segmentation, information retrieval, and document
classification, associate rule mining, web usage tracking
and transaction analysis. Since such technique is required
everywhere and the inductive procedure follows a variety
of principle, clustering is always an active research topic
in various areas [1]. As clustering is an important
technology related to applications, researches use
different models to define clustering problem and propose
different ways to deal with the models. In this paper, we
focus on the clustering problem which is defined by
combining optimization models described as following:
Given a set of input data set
1 2
{ , ,..., }
N
D xx x = , where
, 1,...
d
i
x R i N ∈ = . Clustering algorithm attempts to seek
K partitions of D ,
1 2
{ , ,..., }
K
C CC C = ( K N ≤ ), such that
the quality measure function
1 ,
() (, )
K
i j
k x C x C i k j k
QC dist x x
= ∈ ∈
=
∑∑
is
minimized, where () dist is the distance function between
data objects. Drineas et al have proved that this problem
is NP-hard [2]. Clustering algorithm, which uses the
traverse method, such as PAM [3], can not deal with
large data sets. In order to deal with this kind of
clustering problem, researchers introduce local search
methods and devise a lot of heuristic clustering
algorithms. Figure 1 gives an example of local search
method. From Figure 1, we can find that a local search
starting from Init 1 will converge to Solution 1, on the
contrary, a local search starting from Init 2 will converge
to a different Solution 2. This example indicates the fact
that the local search method is sensitive to the
initialization, that is, different starting points will
converge to different results. Due to the essential fault of
local search, initialization sensitivity problem becomes an
inevitable problem of heuristic clustering algorithm.
In this paper, we make use of the initialization
sensitivity of heuristic clustering algorithm for
developing a novel clustering algorithm. The idea behind
the proposed approach is to utilize each finding of
clustering algorithm as once learned knowledge from data
set, and then accumulate the findings of multiple
clustering executions with different initializations as the
final optimal solution of clustering.
The major contributions of this paper are as follow:
• We define a structure called Local Accumulative
KnowlEdge (LAKE) to capture common parts of
the clustering results.
• We propose a Fast algorithm to find out LAKE
JOURNAL OF COMPUTERS, VOL. 8, NO. 2, FEBRUARY 2013 365
© 2013 ACADEMY PUBLISHER
doi:10.4304/jcp.8.2.365-371