© 2015, IJARCSMS All Rights Reserved 22 | P a g e
ISSN: 232 7782 (Online) 1
Computer Science and Management Studies
International Journal of Advance Research in
Volume 3, Issue 11, November 2015
Research Article / Survey Paper / Case Study
Available online at: www.ijarcsms.com
Parallel Two Phase K-Means Based On Mapreduce
Bharath Kumar Gowru
1
Assistant Professor,
Dept of CSE, Amrita Sai Institute of Science & Technology
India
Pavani Potnuri
2
Assistant Professor,
Dept of CSE, Amrita Sai Institute of Science & Technology
India
Abstract: Clustering is defined as a process of creating collection of abstract objects into classes of related objects such as
objects in the same class are related to each other than those in other classes. Clustering is one of the efficient techniques in
data mining for doing static data analysis in number of domains for example data retrieval...Etc. These days technology is
ever-increasing so the user data volume also increasing. Sometimes the data volume will be in tera bytes or more. Hence
performing clustering on the large amount of data is becoming complicated. As a solution to this problem, a new clustering
algorithm is proposed in Hadoop framework to group the large volumes of data. Hadoop framework has different strategies
for saving large data efficiently. We can store this large data across many systems which are located in same place or
different using Hadoop framework. Hadoop framework programming model is MapReduce in which map and reduce two
phases will perform distributed computations efficiently on large volumes of data. The proposed algorithm is implemented in
Hadoop framework following MapReduce programming model.
Keywords: Data Clustering, K-means, Parallel Distributed Computing and MapReduce.
I. INTRODUCTION
Data analysis is defined as a process of data cleaning, inspecting, data transforming, and data modeling with the aim of dig
out useful information [1]. Based on the analysis results we can easily make decisions and conclusions in data processing. Data
analysis has several approaches in different domains. Data mining is one of the data analysis techniques which focus mainly on
data modeling and information discovery.
Data mining is the process of finding out different patterns from the large data sets. Data sets may be gathered from
different repositories like database systems...Etc. The main objective of the data mining process is to dig out useful information
from the data sets and transform analysis results into an understandable structure to use further. Data mining provides six
classes of data analysis techniques which are anomaly detection, association rule mining, clustering, data classification,
regression [2], and data summarization tasks. Clustering is defined as a process of creating collection of abstract objects into
classes of related objects such as objects in the same class are related to each other than those in other classes. Clustering is one
of the efficient techniques in data mining for doing static data analysis in number of domains for example data retrieval...Etc.
In data mining techniques K-means is a popular clustering algorithm for cluster analysis. K-means clustering groups the n
observations into k clusters where each observation assigned to the cluster which has the nearest mean form it. The data on
which data clustering is to be performed may be in structured or unstructured format. With the development of information
technology, a large volume of data is growing day by day towards a terabytes or more. Performing clustering on such type of
large amounts of data [3] is complex task now. While analyzing structured data most of the applications use a relational
database to store the data. So in such cases we require some superior machines to run bigger databases to process large amounts
of structured data which is cost effective [4].