VOL. 11, NO. 2, JANUARY 2016 ISSN 1819-6608
ARPN Journal of Engineering and Applied Sciences
© 2006-2016 Asian Research Publishing Network (ARPN). All rights reserved.
www.arpnjournals.com
1086
INITIALIZATION OF OPTIMIZED K-MEANS CENTROIDS USING
DIVIDE-AND-CONQUER METHOD
J. James Manoharan and S. Hari Ganesh
Department of Computer Applications, Bishop Heber College, Tiruchirappalli, India
E-Mail: james_7676@yahoo.com
ABSTRACT
K-means clustering algorithm is one of the most popular unsupervised learning algorithm that is broadly used to
clustering the given data items. The k-means algorithm is one of the commonly used clustering methods in data mining. A
number of algorithms have been developed for clustering the data items using K-Means due to its simplicity and efficiency.
The final clustering result of the K-Means clustering algorithm highly depends upon the initial centroids, which are
selected at random by the user. The difficulty of determining “the right number of clusters” in traditional K-Means
clustering has attracted significant importance especially in the recent years. There are many improvement were already
developed to get better performance of the k-means, but most of these methods needed other inputs like threshold values
for the number of data points in a data set. In this work, the proposed algorithm can solve the problems of finding initial
centroids and assigning data items to proper clusters using divide-and-conquer method. So in proposed method, the initial
cluster centers have obtained using divide-and-conquer property after that K-Means algorithm is applied to gain optimal
cluster centers in dataset. The proposed algorithm can improve the execution speed of clustering the data items using little
number of iterations. With the help of mathematical calculations the proposed algorithm decreases the complexity which
we face in k-means clustering algorithm.
Keywords: K-means clustering, centroids, divide-and-conquer.
INTRODUCTION
Due to the enlarged availability of computer
hardware and software and the fast computerization of
business, huge amount of data has been composed and
stored in databases. Researchers have expected that
amount of information in the world doubles for every 20
months. However the raw data cannot be used directly. Its
actual value is predicted by extracting information useful
for assessment support. In most areas, data analysis was
conventionally a manual procedure. When the size of data
manipulation and exploration goes beyond human
capabilities, people look for computing technologies to
computerize the process. Data mining is one of the
youngest research actions in the field of computing
science and is defined as extraction of interesting (non-
trivial, implicit, previously unknown and potentially
useful) patterns or knowledge from huge amount of data.
Data mining is applied to gain some useful information out
of bulk data. There are number of tools and techniques
provided by researchers in data mining to obtain the
pattern out of data.
Clustering is the method of organizing data
objects into a set of disjoint classes called clusters. Large
amount of data is being collected every day in many
business and science areas [7]. This data needs to be
analyzed in order to find interesting information from it,
and one of the most important analyzing methods is data
clustering. The simple K-means clustering algorithm is a
popular data clustering algorithm. It is simple to
implement and it is fast and sensitive [10]. However the
K-Means algorithm has some drawbacks such as selection
of initial centroids, number of iterations needed to find the
clusters, and creation of empty clusters [4]. To overcome
the drawbacks of traditional K-Means clustering algorithm
a lot of works have been done by various researchers. In
real life clustering problems it is quite difficult to choose
the number of clusters present in final result [2]. A large
numbers of procedures have been developed to determine
the number of clusters present in the dataset. The
appropriate number of clusters can be predicted for a given
data set is generally a trial-and-error process made more
difficult by the subjective nature of deciding what
constitute perfect clustering. In this paper, a novel method
is proposed to enhance the initialization problem of K-
Means algorithm because the convergence result of K-
Means algorithm is highly dependent on the initial
centroids [8]. If the initial centroids are not chosen
appropriately then the local optimum problem will be exist
in traditional K-means clustering [5]. The good
convergence result is directly proportional to the superior
centroids. So the proposed method addresses the
initialization as well as local optimum issues of traditional
K-means clustering [1].
TRADITIONAL K-MEANS CLUSTERING
ALGORITHM
The K-Means clustering algorithm is a partition-
based cluster analysis technique. In this algorithm first we
can randomly select k objects as initial centroids, then
calculate the distance between each data object with each
cluster centre and assign the data object to the nearest
cluster and then calculate the new centroids, repeat this
procedure until the criterion function converged. Finally,
this algorithm aims at minimizing an objective function
know as squared error function given by