VOL. 11, NO. 2, JANUARY 2016 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences © 2006-2016 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 1086 INITIALIZATION OF OPTIMIZED K-MEANS CENTROIDS USING DIVIDE-AND-CONQUER METHOD J. James Manoharan and S. Hari Ganesh Department of Computer Applications, Bishop Heber College, Tiruchirappalli, India E-Mail: james_7676@yahoo.com ABSTRACT K-means clustering algorithm is one of the most popular unsupervised learning algorithm that is broadly used to clustering the given data items. The k-means algorithm is one of the commonly used clustering methods in data mining. A number of algorithms have been developed for clustering the data items using K-Means due to its simplicity and efficiency. The final clustering result of the K-Means clustering algorithm highly depends upon the initial centroids, which are selected at random by the user. The difficulty of determining “the right number of clusters” in traditional K-Means clustering has attracted significant importance especially in the recent years. There are many improvement were already developed to get better performance of the k-means, but most of these methods needed other inputs like threshold values for the number of data points in a data set. In this work, the proposed algorithm can solve the problems of finding initial centroids and assigning data items to proper clusters using divide-and-conquer method. So in proposed method, the initial cluster centers have obtained using divide-and-conquer property after that K-Means algorithm is applied to gain optimal cluster centers in dataset. The proposed algorithm can improve the execution speed of clustering the data items using little number of iterations. With the help of mathematical calculations the proposed algorithm decreases the complexity which we face in k-means clustering algorithm. Keywords: K-means clustering, centroids, divide-and-conquer. INTRODUCTION Due to the enlarged availability of computer hardware and software and the fast computerization of business, huge amount of data has been composed and stored in databases. Researchers have expected that amount of information in the world doubles for every 20 months. However the raw data cannot be used directly. Its actual value is predicted by extracting information useful for assessment support. In most areas, data analysis was conventionally a manual procedure. When the size of data manipulation and exploration goes beyond human capabilities, people look for computing technologies to computerize the process. Data mining is one of the youngest research actions in the field of computing science and is defined as extraction of interesting (non- trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data. Data mining is applied to gain some useful information out of bulk data. There are number of tools and techniques provided by researchers in data mining to obtain the pattern out of data. Clustering is the method of organizing data objects into a set of disjoint classes called clusters. Large amount of data is being collected every day in many business and science areas [7]. This data needs to be analyzed in order to find interesting information from it, and one of the most important analyzing methods is data clustering. The simple K-means clustering algorithm is a popular data clustering algorithm. It is simple to implement and it is fast and sensitive [10]. However the K-Means algorithm has some drawbacks such as selection of initial centroids, number of iterations needed to find the clusters, and creation of empty clusters [4]. To overcome the drawbacks of traditional K-Means clustering algorithm a lot of works have been done by various researchers. In real life clustering problems it is quite difficult to choose the number of clusters present in final result [2]. A large numbers of procedures have been developed to determine the number of clusters present in the dataset. The appropriate number of clusters can be predicted for a given data set is generally a trial-and-error process made more difficult by the subjective nature of deciding what constitute perfect clustering. In this paper, a novel method is proposed to enhance the initialization problem of K- Means algorithm because the convergence result of K- Means algorithm is highly dependent on the initial centroids [8]. If the initial centroids are not chosen appropriately then the local optimum problem will be exist in traditional K-means clustering [5]. The good convergence result is directly proportional to the superior centroids. So the proposed method addresses the initialization as well as local optimum issues of traditional K-means clustering [1]. TRADITIONAL K-MEANS CLUSTERING ALGORITHM The K-Means clustering algorithm is a partition- based cluster analysis technique. In this algorithm first we can randomly select k objects as initial centroids, then calculate the distance between each data object with each cluster centre and assign the data object to the nearest cluster and then calculate the new centroids, repeat this procedure until the criterion function converged. Finally, this algorithm aims at minimizing an objective function know as squared error function given by