Performance Improvement in MapReduce via Overlapping of Mapper and Reducer Saurabh Gupta, Manish Pandey Maulana Azad National Institute of Technology, Bhopal, 462003, India AbstractThe MapReduce model supports big data processing on cluster by specifying mapper and reducer function. User defines a mapper function to process input key-value pairs and produces intermediate key- value pairs. Reducer function merges all the values for the same key and produces output key-value pairs. Hadoop is one of the popular framework which supports MapReduce model. In the traditional MapReduce model, Hadoop forces reducer to start its execution after all mappers finishes its execution. In turn, this causes an inefficient utilization of system resources and also impacts the performance. To overcome the limitation of traditional Hadoop, this article proposes two approaches which together solves the above mentioned limitation. The first solution, overlapping of mapper and reducer i.e. starts reducer task as soon as a predefined number of map tasks completed. The second solution, hierarchical reduction, in which there are several stages of reducer task. When reducer task completed its processing on the data that is generated by corresponding mapper task, another stage of reduce task is started. By combining both the solutions, three algorithms i.e. PageRank, Kmeans and WordCount are implemented in this article. The experimental results have shown that the speedup can be achieved by 6.5%, 7.02% and 10.38% over the traditional Hadoop for WordCount, Kmeans and PageRank applications respectively. Keywords - distributed computing, MapReduce, Hadoop, cloud computing. I. INTRODUCTION In day to day life, the amount of data being generated from numerous essential zones, including e-business, finance, hospitality data, CCTV cameras, education and environment, is colossal. To obtain important and relevant information, it is important to process this bulk of data in order to make business decisions or improve end user services. In recent years, a large number of computing frameworks [1], [2], [3], [4], [5], [6], [7], [8], [9], [10] have been developed for big data processing. Among these frameworks, MapReduce [1] (Hadoop) is the most extensively used framework because of its simplicity, effortlessness and scalability. Hadoop framework is suitable for variety of algorithms, including large scale image processing [11], relational query evaluation [12] and web scale document analysis [13]. Apart from these domains, there are many applications i.e. PageRank[14], internet traffic analysis[15], social network analysis[16], neural network analysis[17], clustering[18], recursive relational queries[19] and Hypertext Induced Topic Search (HITS)[20] which require iterative calculations. Hence it is necessary to process these applications in a very efficient way. Hadoop uses the distributed architecture and process this huge data in cluster environment [21]. Hadoop framework divided into two parts to handle this huge information: Hadoop Distributed File System MapReduce. 1) Hadoop Distributed File System(HDFS) The Hadoop distributed file system is a fault tolerant and scalable file system for the MapReduce framework that is proposed to run on commodity hardware. In this file system; input data is broken into small pieces known as blocks. The default size for the block is either 64MB or 128MB. For data availability, it also supports data replication. By International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 7, July 2016 572 https://sites.google.com/site/ijcsis/ ISSN 1947-5500