Proceedings of the 12 th INDIACom; INDIACom-2018; IEEE Conference ID: 42835 2018 5 th International Conference on “Computing for Sustainable Global Development”, 14 th - 16 th March, 2018 Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA) Fine tuning of MapReduce jobs using parallel K Map clustering Suyash Mishra Computer sciences Noida International University Great Noida, India ssuyashmishra@gmail.com Dr Anuranjan Misra Computer sciences Noida International University Great Noida, India Amc290@gmail.com Dr Suryakant yadav Computer sciences Noida International University Great Noida, India Suryakantyadav11@gmail.com Abstract— Advancement and development of information technology has led to huge growth in data, which pose huge challenge into data storage and analysis to conclude meaningful information from data. Size of data ranges from petabyte or exabyte range, to mine information out of large data set will increase the computing capacity. Thus to address and manage this high velocity data growth an advanced processing algorithms and methods are required for data analysis. To achieve same MapReduce word count program and P-KMeans a parallel clustering algorithm used in Hadoop .Through experiment, it has been concluded that execution time reduce when nodes count increases in a cluster, but also some of the important points has been observed while conducting experiment. Various performance change, and plotted results on different performance charts. This paper aims to study MapReduce applications and its performance verification and improvement recorded for Parallel KMeans algorithm on four nodes Hadoop Cluster. Keywords— Hadoop, Big data, Data Clustering, K-Means algorithm Parallel Computing, Map-Reduce, word count, HDFS . I. INTRODUCTION Traditional data storage and processing capabilities are limited and was reliant on available infrastructure processing capability, storage and processing requirements, which deemed to be very different from today. Thus, those approaches and databases are facing Sevier technical challenges to accommodate fast growing Big Data storage and processing demands. Current rapid advancement and ongoing progress in the social media, robotics, web, healthcare, mobile devices and research area are producing data, which is growing exponentially, and processing such data becomes a huge challenge. How to accommodate and handle these enormous collections of data, as well as further churn out meaningful information, has become a problematic issue. Business decision are result of detailed analysis of a vast and verity of big data related to the industry or organization. Therefore, to bring meaningful insight from data require efficient data analysis methods. In distributed environment data mining unearth data relationship by using best practices of artificial intelligence, relationship databases etc. Besides analytics, also helps in data management and data modeling. Distributed Computing is a methodology, which resolves data processing and storage challenge by sharing the CPU processing of network systems. Every independent machine on network is known as node and cluster is a collection of nodes in a network. Apache Hadoop [1] is a distributed open source framework, which provides parallel and fault tolerant and scalable processing capability to work on big data. It was developed by Google’s MapReduce and Google File Systems which was later adopted by Apache. It is suitable for applications, which produces huge and verity of data and desires quick results for data analysis and decision-making purpose by harnessing ability of Hadoop MapReduce parallel execution on distributed environment. It follows a Map and Reduce programming methodology by splitting of data and this data partition is then stored on various nodes of distributed network for parallel processing. Interim output chunks is then combined as result. Hadoop acts as a file system for the storage and organization of output data like in HDFS System. The Hadoop’s fault tolerant ability automatically take care of node failure, and re-runs the failed task on idle or underutilized node. In an era of high demand to mine, fast growing big data, which requires fast and efficient information retrieval on a distributed environment of Hadoop. This minimizes execution time to complete assigned task but also mitigates the individual machine need to process hue amount of data. Google File Systems [4] and MapReduce concept gave power to process big data without worrying scale of data analytics to be done by providing ery convenient processing. This has provided great data exploration and processing by exploring structured and unstructured data .There are dedicated research and specialized plug in software’s are available e.g. Hive[5] , Zookeeper [6] ,Pig[7] which are Copy Right © INDIACom-2018; ISSN 0973-7529; ISBN 978-93-80544-28-1 552