Data pipeline in MapReduce Jiaan Zeng and Beth Plale School of Informatics and Computing Indiana University Bloomington, Indiana 47408 Email: {jiaazeng, plale} @indiana.edu Abstract—MapReduce is an effective programming model for large scale text and data analysis. Traditional MapReduce implementation, e.g., Hadoop, has the restriction that before any analysis can take place, the entire input dataset must be loaded into the cluster. This can introduce sizable latency when the data set is large, and when it is not possible to load the data once, and process many times - a situation that exists for log files, health records and protected texts for instance. We propose a data pipeline approach to hide data upload latency in MapReduce analysis. Our implementation, which is based on Hadoop MapReduce, is completely transparent to user. It introduces a distributed concurrency queue to coordinate data block allocation and synchronization so as to overlap data upload and execution. The paper overcomes two challenges: a fixed number of maps scheduling and dynamic number of maps scheduling allows for better handling of input data sets of unknown size. We also employ delay scheduler to achieve data locality for data pipeline. The evaluation of the solution on different applications on real world data sets shows that our approach shows performance gains. I. I NTRODUCTION The MapReduce framework [1] has considerable uptake in both industry [2] and academia [3], [4]. It provides a simple programming model for large scale data processing with built in fault tolerance and parallelization coordination. The original MapReduce framework uses a distributed file system [5] that replicates large blocks across distributed disks. The framework schedules map and reduce tasks to work on data blocks in local disk. MapReduce has the limitation of being batch oriented in the sense that the complete input data set must be loaded into the cluster before any analytical operations can begin resulting in low cluster utilization while compute instances wait for data to be loaded. For applications that have security sensitivities and need to use public compute resources, the data set must be loaded and reloaded for each use. Suppose for a given set of applications, it can be determined in advance that a streaming model of processing will work. For this class of applications, we can reduce the data upload latency by overlapping processing and data load for map reduce applications. Such applications that could benefit from a stream approach to data loading include click stream log analysis [6], [7]. The HathiTrust Research Center [8] supports MapReduce style analysis over massive numbers of digitized books from university libraries. The kinds of analysis carried out on the texts include classification (e.g., topic modeling), statistical summary (e.g., tag clouds), network graph processing, and trend tracking (e.g., sentiment analysis). These applications all tend to heavily utilize a Solr index to discern the set of texts that match a certain selection criteria (e.g., 19th century women authors), then use this work set of texts as the input dataset against which parallel execution takes place. The texts are searched in a linear fashion, page-by-page, making the algorithms amenable to a streaming approach to data loading. We propose a data pipeline approach in MapReduce. The data pipeline uses the storage block as the stream’s logical unit. That is, when a block is uploaded completely, processing on it can begin. Map tasks are launched before the full upload process has finished, giving an overlapped execution and data upload. Specifically, we propose adding a distributed concur- rency queue to coordinate between the distributed file system and MapReduce jobs. The file system is a producer, and it produces block metadata to the queue, while MapReduce jobs act as consumers and get notified when there is available block metadata for them. An additional benefit, is that the reducer can get data earlier which promotes early result return. Traditional MapReduce job splits the input data and launches a number of map tasks based on the number of input splits. However, in our proposed data pipeline map reduce, a MapReduce job does not know how much input data there will be when it is launched. To deal with the number of map tasks to create, we study both fixed numbers of map tasks and dynamic numbers. In addition, we take data locality into account in the context of data pipeline by using a delay scheduler [9]. In summary, this paper makes the following contributions: • An architecture that coordinates pipelined data storage with MapReduce processing. • Two map task launching approaches that handle un- known input data size with locality and fault tolerance consideration. • A user transparent implementation of the data pipeline that requires no changes to existing code. • Experimental results that show improved performance. The remainder of the paper is organized as follows. Section II presents related work. Section III describes the data pipeline architecture in detail and Section IV discusses experimental results. Finally, Section V concludes with future work. II. RELATED WORK Research has looked at pipelined/data streaming to reduce data movement latency. C-MR [10] overlaps data upload and process by storing input data into a memory buffer then schedules map operators to process data. C-MR is limited to running on a single machine, making it unsuitable for large