Investigation of Data Locality and Fairness in MapReduce Zhenhua Guo, Geoffrey Fox, Mo Zhou School of Informatics and Computing Indiana University Bloomington, IN, USA {zhguo,gcf,mozhou}@cs.indiana.edu ABSTRACT In data-intensive computing, MapReduce is an important tool that allows users to process large amounts of data easily. Its data locality aware scheduling strategy exploits the locality of data accessing to minimize data movement and thus reduce network traffic. In this paper, we firstly analyze the state-of-the-art MapReduce scheduling algorithms and demonstrate that optimal scheduling is not guaranteed. After that, we mathematically reformulate the scheduling problem by using a cost matrix to capture the cost of data staging and propose an algorithm lsap- sched that yields optimal data locality. In addition, we integrate fairness and data locality into a unified algorithm lsap-fair-sched in which users can easily adjust the tradeoffs between data locality and fairness. At last, extensive simulation experiments are conducted to show that our algorithms can improve the ratio of data local tasks by up to 14%, reduce data movement cost by up to 90%, and balance fairness and data locality effectively. Categories and Subject Descriptors C.2.4 [Computer-Communication Networks]: Distributed Systems – Distributed Applications; D.4.1 [Operating Systems]: Process Management – Scheduling. General Terms Algorithms, Management, Measurement, Performance, Design Keywords MapReduce, data locality, fairness, scheduling 1. INTRODUCTION In many science domains, data are being produced and collected continuously in an unprecedented rate by advanced instruments such as Large Hadron Collider, next-generation genetic sequencers, and astronomical telescopes. To process the huge amount of data requires powerful hardware and efficient distributed computing frameworks. For data parallel applications, MapReduce [1] has been proposed by Google and adopted in both industry [2] and academia [3,4]. Lin et al. experimented with text processing applications such as inverted indexing and page rank [3]. Qiu et al. utilized MapReduce to run biology applications such as sequence alignment and multidimensional scaling [4]. One of the most appealing features of MapReduce is data locality aware scheduling, which enables the scheduler to consider data affinity and bring compute to data. That is different from traditional grid clusters where storage and computation are separated, shared file systems are mounted to facilitate data accessing, and input data are fetched implicitly on demand. Data movement and cross-rack traffic are reduced in MapReduce, which is highly desirable in data-intensive computing. Mostly we want to maximize the percent of tasks that achieve data locality to improve the overall performance. The default scheduling strategy in Hadoop takes a task-by-task approach and is not optimal. In this paper we propose a new algorithm lsap-sched that takes into consideration all tasks and available resources at once and yields optimal data locality. The reduction of job execution time is not always proportional to the improvement of data locality. Consider two jobs A and B that run the same application with different input data of the same size. 90% of the tasks in A achieve data locality while 80% of the tasks in B achieve data locality. Although A has better data locality than B, we cannot conclude that the data transfer time of A is shorter than that of B because non data local tasks of B may be closer to their data sources and thus able to fetch data much faster than that of A. In environments with network heterogeneity, the bandwidths of different pairs of nodes may be drastically disparate and data movement costs should not be assimilated. In addition to data locality, fairness is also important in shared clusters. We want to avoid the scenario that a small number of users overwhelm the whole system and thus render other users unable to run any useful job. Traditional batch schedulers adopt a reservation-based resource allocation mechanism. For each job, a requested number of nodes are reserved for a specific period of time. Although the whole cluster is shared, the use of individual nodes is usually exclusive among users. MapReduce adopts a more dynamic and aggressive approach to allow tasks owned by different users to run on the same node. Capacity scheduler [5] and fair scheduler [6] are two typical Hadoop schedulers that support multi-tenancy and fair sharing. System administrators manually specify rations for job groups that are enforced by the scheduler. Fairness and data locality do not always work in symphony and sometimes they conflict. Strict fairness may result in degradation of data locality, and purely data locality driven scheduling strategy may result in substantial unfairness of resource usage. In our work, we investigate the tradeoffs between data locality and fairness, and propose an algorithm lsap-fair- sched allowing users to express the tradeoffs easily. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’10, Month 1–2, 2010, City, State, Country. Copyright 2010 ACM 1-58113-000-0/00/0010…$10.00.