Parallel K -Means Clustering Based on MapReduce Weizhong Zhao 1,2 , Huifang Ma 1,2 , and Qing He 1 1 The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate University of Chinese Academy of Sciences {zhaowz,mahf,heq}@ics.ict.ac.cn Abstract. Data clustering has been received considerable attention in many applications, such as data mining, document retrieval, image seg- mentation and pattern classiﬁcation. The enlarging volumes of informa- tion emerging by the progress of technology, makes clustering of very large scale of data a challenging task. In order to deal with the problem, many researchers try to design eﬃcient parallel clustering algorithms. In this paper, we propose a parallel k -means clustering algorithm based on MapReduce, which is a simple yet powerful parallel programming technique. The experimental results demonstrate that the proposed algo- rithm can scale well and eﬃciently process large datasets on commodity hardware. Keywords: Data mining; Parallel clustering; K -means; Hadoop; MapRe- duce. 1 Introduction With the development of information technology, data volumes processed by many applications will routinely cross the peta-scale threshold, which would in turn increase the computational requirements. Eﬃcient parallel clustering algorithms and implementation techniques are the key to meeting the scalability and performance requirements entailed in such scientiﬁc data analyses. So far, several researchers have proposed some parallel clustering algorithms [1,2,3]. All these parallel clustering algorithms have the following drawbacks: a) They assume that all objects can reside in main memory at the same time; b) Their parallel systems have provided restricted programming models and used the restrictions to parallelize the computation automatically. Both assumptions are prohibitive for very large datasets with millions of objects. Therefore, dataset oriented parallel clustering algorithms should be developed. MapReduce [4,5,6,7] is a programming model and an associated implementa- tion for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine fail- ures, and schedules inter-machine communication to make eﬃcient use of the M.G. Jaatun, G. Zhao, and C. Rong (Eds.): CloudCom 2009, LNCS 5931, pp. 674–679, 2009. c  Springer-Verlag Berlin Heidelberg 2009