Proceedings of the 12
th
INDIACom; INDIACom-2018; IEEE Conference ID: 42835
2018 5
th
International Conference on “Computing for Sustainable Global Development”, 14
th
- 16
th
March, 2018
Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)
Fine tuning of MapReduce jobs using parallel K Map
clustering
Suyash Mishra
Computer sciences
Noida International University
Great Noida, India
ssuyashmishra@gmail.com
Dr Anuranjan Misra
Computer sciences
Noida International University
Great Noida, India
Amc290@gmail.com
Dr Suryakant yadav
Computer sciences
Noida International University
Great Noida, India
Suryakantyadav11@gmail.com
Abstract— Advancement and development of information
technology has led to huge growth in data, which pose huge
challenge into data storage and analysis to conclude meaningful
information from data. Size of data ranges from petabyte or
exabyte range, to mine information out of large data set will
increase the computing capacity. Thus to address and manage
this high velocity data growth an advanced processing algorithms
and methods are required for data analysis. To achieve same
MapReduce word count program and P-KMeans a parallel
clustering algorithm used in Hadoop .Through experiment, it has
been concluded that execution time reduce when nodes count
increases in a cluster, but also some of the important points has
been observed while conducting experiment. Various
performance change, and plotted results on different
performance charts. This paper aims to study MapReduce
applications and its performance verification and improvement
recorded for Parallel KMeans algorithm on four nodes Hadoop
Cluster.
Keywords— Hadoop, Big data, Data Clustering, K-Means
algorithm Parallel Computing, Map-Reduce, word count, HDFS .
I.
INTRODUCTION
Traditional data storage and processing capabilities are limited
and was reliant on available infrastructure processing
capability, storage and processing requirements, which
deemed to be very different from today. Thus, those
approaches and databases are facing Sevier technical
challenges to accommodate fast growing Big Data storage and
processing demands.
Current rapid advancement and ongoing progress in the social
media, robotics, web, healthcare, mobile devices and research
area are producing data, which is growing exponentially, and
processing such data becomes a huge challenge. How to
accommodate and handle these enormous collections of data,
as well as further churn out meaningful information, has
become a problematic issue.
Business decision are result of detailed analysis of a vast
and verity of big data related to the industry or organization.
Therefore, to bring meaningful insight from data require
efficient data analysis methods.
In distributed environment data mining unearth data
relationship by using best practices of artificial intelligence,
relationship databases etc. Besides analytics, also helps in data
management and data modeling. Distributed Computing is a
methodology, which resolves data processing and storage
challenge by sharing the CPU processing of network systems.
Every independent machine on network is known as node and
cluster is a collection of nodes in a network. Apache Hadoop
[1] is a distributed open source framework, which provides
parallel and fault tolerant and scalable processing capability to
work on big data. It was developed by Google’s MapReduce
and Google File Systems which was later adopted by Apache.
It is suitable for applications, which produces huge and verity
of data and desires quick results for data analysis and
decision-making purpose by harnessing ability of Hadoop
MapReduce parallel execution on distributed environment. It
follows a Map and Reduce programming methodology by
splitting of data and this data partition is then stored on
various nodes of distributed network for parallel processing.
Interim output chunks is then combined as result. Hadoop acts
as a file system for the storage and organization of output data
like in HDFS System. The Hadoop’s fault tolerant ability
automatically take care of node failure, and re-runs the failed
task on idle or underutilized node. In an era of high demand to
mine, fast growing big data, which requires fast and efficient
information retrieval on a distributed environment of Hadoop.
This minimizes execution time to complete assigned task but
also mitigates the individual machine need to process hue
amount of data. Google File Systems [4] and MapReduce
concept gave power to process big data without worrying scale
of data analytics to be done by providing ery convenient
processing.
This has provided great data exploration and processing
by exploring structured and unstructured data .There are
dedicated research and specialized plug in software’s are
available e.g. Hive[5] , Zookeeper [6] ,Pig[7] which are
Copy Right © INDIACom-2018; ISSN 0973-7529; ISBN 978-93-80544-28-1 552