© 2015, IJARCSMS All Rights Reserved 22 | P a g e ISSN: 232 7782 (Online) 1 Computer Science and Management Studies International Journal of Advance Research in Volume 3, Issue 11, November 2015 Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Parallel Two Phase K-Means Based On Mapreduce Bharath Kumar Gowru 1 Assistant Professor, Dept of CSE, Amrita Sai Institute of Science & Technology India Pavani Potnuri 2 Assistant Professor, Dept of CSE, Amrita Sai Institute of Science & Technology India Abstract: Clustering is defined as a process of creating collection of abstract objects into classes of related objects such as objects in the same class are related to each other than those in other classes. Clustering is one of the efficient techniques in data mining for doing static data analysis in number of domains for example data retrieval...Etc. These days technology is ever-increasing so the user data volume also increasing. Sometimes the data volume will be in tera bytes or more. Hence performing clustering on the large amount of data is becoming complicated. As a solution to this problem, a new clustering algorithm is proposed in Hadoop framework to group the large volumes of data. Hadoop framework has different strategies for saving large data efficiently. We can store this large data across many systems which are located in same place or different using Hadoop framework. Hadoop framework programming model is MapReduce in which map and reduce two phases will perform distributed computations efficiently on large volumes of data. The proposed algorithm is implemented in Hadoop framework following MapReduce programming model. Keywords: Data Clustering, K-means, Parallel Distributed Computing and MapReduce. I. INTRODUCTION Data analysis is defined as a process of data cleaning, inspecting, data transforming, and data modeling with the aim of dig out useful information [1]. Based on the analysis results we can easily make decisions and conclusions in data processing. Data analysis has several approaches in different domains. Data mining is one of the data analysis techniques which focus mainly on data modeling and information discovery. Data mining is the process of finding out different patterns from the large data sets. Data sets may be gathered from different repositories like database systems...Etc. The main objective of the data mining process is to dig out useful information from the data sets and transform analysis results into an understandable structure to use further. Data mining provides six classes of data analysis techniques which are anomaly detection, association rule mining, clustering, data classification, regression [2], and data summarization tasks. Clustering is defined as a process of creating collection of abstract objects into classes of related objects such as objects in the same class are related to each other than those in other classes. Clustering is one of the efficient techniques in data mining for doing static data analysis in number of domains for example data retrieval...Etc. In data mining techniques K-means is a popular clustering algorithm for cluster analysis. K-means clustering groups the n observations into k clusters where each observation assigned to the cluster which has the nearest mean form it. The data on which data clustering is to be performed may be in structured or unstructured format. With the development of information technology, a large volume of data is growing day by day towards a terabytes or more. Performing clustering on such type of large amounts of data [3] is complex task now. While analyzing structured data most of the applications use a relational database to store the data. So in such cases we require some superior machines to run bigger databases to process large amounts of structured data which is cost effective [4].