A MapReduce framework to implement Enhanced K- means algorithm Mr.Bhimasen.V.Purohit PG Student, Dept of Computer Science & Engineering, R.V College of Engg, Bengaluru, India, bhimasen.v.purohit@gmail.com Prof. Rajashree Shettar, Professor,Dept. of Computer Science and Engg., R.V.College of Engineering Bengaluru, India rajashreeshettar@rvce.edu.in Abstract— Data clustering forms a major part of an important aspect of big data analytics. Data Clustering helps to categorize the data, which further leads to recognize hidden patterns. K- means is one such clustering algorithm which is well known for its simple computation and also the capability of being executed in parallel. Big data analytics requires distributed computing which can be achieved using MapReduce technique. In this paper, enhanced K-means algorithm has been implemented using MapReduce technique which comes with Hadoop platform. The enhanced K-means algorithm is efficient compared to traditional K-means algorithm as it selects the initial centroids of cluster by averaging the data points, rather than random selection of centroids for initial computations as being done in traditional K- means algorithm. The enhanced K-means algorithm achieves better accuracy in cluster formation than traditional K-means. Keywords—Hadoop,MapReduce,Data Clustering,K-means I. INTRODUCTION The digitization of information content has led to generation of enormous amount of data in every field. The data generated in organizations like Facebook, Google reach Peta Bytes and Exa Bytes of data per day. Processing this huge amount of data using traditional data mining techniques is infeasible. Hence to process large amount of data, parallel computations becomes necessary [1]. Data clustering is a data mining technique in which the data points are grouped into clusters such that the distance between points in a same cluster is very less and distance between a point in one cluster and another cluster is maximum. Traditional data mining has many clustering algorithms like partitioning methods, hierarchical methods, density based techniques, etc. Many researchers are involved in parallelizing these clustering algorithms to deal with big data [2]. MapReduce [3] is a programming model that comes with Hadoop environment. It is a parallel programming model which can be used to deal with large data sets. The MapReduce includes twophases namely, Map and Reduce phases. Users have to write programs as Map and reduce functions and the hadoop environment takes care of parallel execution although the various table text styles are provided. II. RELATED WORK Though the k-means algorithm is simple and efficient, it starts degrading when number of clusters and data size increases. To avoid this multi restarting k-means can be used. But to work on this more and more staring points are needed which again leads to a bottleneck when large data is used [4].Many variants have been proposed to improve the performance of k-means algorithm. The methods that are proposed for incremental dataset are noteworthy [5]. Clusters are generated incrementally in these methods. Global k-means and a fast version of global k-means [6] are proposed which arrive at global minimize of a cluster. Along with this a modified version of global k-means is proposed by Adil.M.Bagirov et al [7]. There are two types of modifications available. Lai et al proposed a fast version of global k-means algorithm which was intended to reduce time complexity [8]. The modification that has been proposed by Adil et al is aimed at reducing the space complexity, i.e usage of lesser memory. These variants of K-means are successful in getting better results than traditional algorithm. The enhancement to this K-means algorithm can also be done by some techniques to find out initial centroids. An intelligent method of selecting centroids is discussed in Anand.M.Banswade et al [9]. First centroid will be selecting by averaging on the data points. The remaining centroids are calculated by finding the points which are far from first calculated centroid. The experimental results in this paper showed better accuracy. Azhar Rauf et al [10] proposed another enhancement to K-means algorithm where the execution is divided into two phases. The output of first phase will form the initial centroids. The input data array is divided into smaller arrays. These sub-arrays represent clusters. In second phase, the cluster sizes vary and after iterative computations final clusters are formed. The basic K-means algorithm has been implemented in MapReduce manner [11]. Here finding distances is in Map phase and combining results in Reduce phase. But again the traditional method of random selection of initial centroids is adopted in all implementation. Hence in this work we analyze that the proposed method of selecting initial centroids will give a better outcome. 361 978-1-4673-9223-5/15/$31.00 c 2015 IEEE