AUTHOR ET AL.: TITLE 1 Stochastic Modeling and Optimization of Stragglers Farshid Farhat, Diman Zad Tootaghaj, Yuxiong He, Anand Sivasubramaniam, Mahmut Kandemir, and Chita R. Das Abstract—MapReduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to process them. However, it has been observed that when the number of servers increases, the map phase can take much longer than expected. This paper analytically shows that the stochastic behavior of the servers has a negative effect on the completion time of a MapReduce job, and continuously increasing the number of servers without accurate scheduling can degrade the overall performance. We analytically capture the effects of stragglers (delayed mappers or reducers) on the performance. Based on a completion time distribution of the tasks, we then model the map phase in terms of hardware, system, and application parameters. Mean sojourn time (MST), the time needed to sync the completed tasks at one reducer, is mathematically formulated. Following that, we optimize MST by finding the task inter-arrival time to each server. We investigate the optimal mapping (as a stochastic scheduling) leading to an equilibrium property for different types of inter-arrival and service time distributions in a datacenter with different types of nodes. To our knowledge, there is no stochastic scheduler for stragglers, and all prior studies are deterministic approaches patchable to ours. Our experimental results show the performance of the different types of schedulers targeting MapReduce applications. We also show that, in the case of mixed deterministic and stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST. Index Terms— MapReduce, Modeling and Optimization, Performance Evaluation, Queuing Theory. —————————— —————————— 1 INTRODUCTION APREDUCE has become a popular paradigm for structuring large scale parallel computations in dat- acenters. By decomposing a given computation into (one or more) Map and Reduce phases, the work within each phase can be accomplished in parallel without worrying about data dependencies, and it is only at the boundaries between these phases where one needs to worry about issues such as data availability and dependency enforce- ment. At the same time, with the possibility of elastically creating tasks of different sizes within each phase, these computations can adjust themselves to the dynamic ca- pacities available in the datacenter. There has been a lot of prior work in the past decade to leverage this paradigm for different applications [1,2,3], as well as in the systems substrate needed to efficiently support their execution at runtime [4,5,6]. While each phase is highly parallel, the inefficiencies in MapReduce execution manifest at the boundaries be- tween the phases as data exchanges and synchronization stalls, which ensure completion of the prior phases. One of these inefficiencies is commonly referred to as the straggler problem of mappers, where a reduce phase has to wait until all mappers have completed their work [4]. Even if there is one such straggler, the entire computation is consequently slowed down. Prior work [7,8,9,10] has identified several reasons for such stragglers including load imbalance, scheduling inefficiencies, data locality, communication overheads, etc. There have also been ef- forts looking to address one or more of these concerns to mitigate the straggler problem [7,8,11,12,13]. While all these prior efforts are important, and useful to address this problem, we believe that a rigorous set of analytical tools is needed in order to: (i) understand the conse- quences of stragglers on the performance slowdown in MapReduce execution; (ii) be able to quantify this slow- down as a function of different hardware (processing speed, communication bandwidth, etc.), system (schedul- ing policy, task to node assignment, data distribution, etc.), and application (data size, computation needs, etc.) parameters; (iii) study the impact of different scaling strategies (number of processing nodes, the computation to communication and data bandwidths, tasks per node, etc.) on this slowdown; (iv) undertake “what-if” studies for different alternatives (alternate scheduling policies, task assignments to nodes, etc.) beyond what is available to experiment with on the actual platform/system; and (v) use such capabilities for a wide range of optimiza- tions: they could include determining resources (nodes, their memory capacities, etc.) to provision for the MapReduce jobs, the number of tasks to create and even adjust dynamically, the assignment of these tasks to dif- ferent kinds of nodes (since datacenters could have heter- ogeneous servers available at a given time), adjusting the scheduling policies, running redundant versions of the tasks based on the trade-offs between estimated wait times and additional resources mandated, executing a MapReduce computation under a budgetary (perfor- mance, power, cost) constraint, etc. xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society M ———————————————— Farshid Farhat, Diman Zad Tootaghaj, Anand Sivasubramaniam, Mahmut Kandemir, and Chita R. Das are with the school of electrical engineering and computer science, the Pennsylvania State University, University Park, PA, 16802, USA. Email: {fuf111,dxz149,anand,kandemir,das}@cse.psu.edu. Yuxiong He is with the Microsoft Research, Redmond, WA 98052 USA. Email: yuxhe@microsoft.com.