Performance Optimization through Data Pipeline in Heterogeneous Hadoop Cluster D C VINUTHA Research Scholar, Dept. of CSE RNS Institute of Technology, Bengaluru Associate Professor, Dept. of ISE Vidyavardhaka College of Engineering, Mysuru. Visvesvaraya Technological University, Belagavi, Karnataka. E-mail: vinuthadc@vvce.ac.in G T RAJU Professor, Dept.of CSE RNS Institute of Technology, Bengaluru Visvesvaraya Technological University, Belagavi, Karnataka. Abstract: Hadoop is a framework to implement MapReduce. MapReduce is a programming model to process huge amount of data in a distributed manner. The challenge that occurs is until the entire input data is stored into HDFS, map phase cannot be initiated, and this introduces a delay. At present, to store huge volume of data into HDFS, synchronous pipeline is used to transfer data from client to data node. It transfers block by block and waits for ACK packets from all the data nodes. Client will be idle till it receives an ACK packet from all the data nodes. As a result, data transfer time is increased. Hence Asynchronous multiple pipeline file write protocol is proposed in this paper, to write data intoD ata nodes. Data Pipeline is used between the Job Tracker and HDFS for overlapping the execution and write operation. It provides load balancing and improves data locality, which maximizes the throughput and minimizes the delay. Experiments have been conducted on web log file of NASA and Academic websites on Click Count and Sessionization applications. Experimental results show that data write operation is 1.49 times faster when compared to conventional method and Turnaround time using Dynamic method is improved by 12.4 % for Sessionization and 25.6% for Click Count application compared conventional Hadoop MapReduce. Using the proposed method, an average throughput of 23.74 mbps for Sessionization and 27.7 mbps for Click Count application using Dynamic method is obtained from the experimental results. I. INTRODUCTION MapReduce is a programming model for processing big data at industry and academia [1-3]. Hadoop is an open source implementation of MapReduce and has become a preferred platform for processing and analyzing the big data and has a feature of fault tolerance, distributed, parallel processing and load balancing [2,4]. It has 2 components Hadoop Distributed File System (HDFS) and MapReduce. HDFS follows a Master slave Architecture, it contains a single Name node and number of Data nodes. Name node is a master node which maintains and manages the blocks present in the data nodes. It records the metadata of all the files stored in the cluster like location of the block stored, size of files, permissions etc. The Data nodes send the status of the node frequently through heartbeat message to the Name node. HDFS is a specially designed file system to store huge data sets with a cluster of commodity hardware and streaming access patterns. MapReduce Framework performs map and reduce tasks to process the data blocks stored in the disk. The issue in MapReduce is the entire input data needs to be stored into the HDFS before processing the data blocks, this result into the underutilization of cluster. First identify streaming access pattern works for the given application and the delay introduced during data uploading is reduced by overlapping the transmission of data phase with the data processing phase and provides the balance between data locality and load balancing to simultaneously maximize throughput and to minimize the delay. In this proposed work, we are considering big data corresponding to the entries in a log file of Research and Academic websites for Analysis [5]. Asynchronous multiple pipelines are used to write the data block to HDFS and Data Pipeline is used between HDFS and Job Tracker to overlap the data transmission and data processing phase [6]. Here, a client first sends a request to the Name node to store the input data into the HDFS. Name node splits the input data and sends the block ID and data node ID for each input split and the client uses this information to store the input blocks to the corresponding data node. In this proposed work, Job Tracker will not wait till all the blocks are loaded into the HDFS, instead after storing a block into the data node, it launches a map task to process the data. As a result, the time required to International Journal of Computer Science and Information Security (IJCSIS), Vol. 17, No. 4, April 2019 97 https://sites.google.com/site/ijcsis/ ISSN 1947-5500