AUTHOR ET AL.: TITLE 1
Stochastic Modeling and Optimization of
Stragglers
Farshid Farhat, Diman Zad Tootaghaj, Yuxiong He, Anand Sivasubramaniam, Mahmut Kandemir,
and Chita R. Das
Abstract—MapReduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to
process them. However, it has been observed that when the number of servers increases, the map phase can take much longer
than expected. This paper analytically shows that the stochastic behavior of the servers has a negative effect on the completion
time of a MapReduce job, and continuously increasing the number of servers without accurate scheduling can degrade the
overall performance. We analytically capture the effects of stragglers (delayed mappers or reducers) on the performance.
Based on a completion time distribution of the tasks, we then model the map phase in terms of hardware, system, and
application parameters. Mean sojourn time (MST), the time needed to sync the completed tasks at one reducer, is
mathematically formulated. Following that, we optimize MST by finding the task inter-arrival time to each server. We investigate
the optimal mapping (as a stochastic scheduling) leading to an equilibrium property for different types of inter-arrival and service
time distributions in a datacenter with different types of nodes. To our knowledge, there is no stochastic scheduler for stragglers,
and all prior studies are deterministic approaches patchable to ours. Our experimental results show the performance of the
different types of schedulers targeting MapReduce applications. We also show that, in the case of mixed deterministic and
stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST.
Index Terms— MapReduce, Modeling and Optimization, Performance Evaluation, Queuing Theory.
—————————— ——————————
1 INTRODUCTION
APREDUCE has become a popular paradigm for
structuring large scale parallel computations in dat-
acenters. By decomposing a given computation into (one
or more) Map and Reduce phases, the work within each
phase can be accomplished in parallel without worrying
about data dependencies, and it is only at the boundaries
between these phases where one needs to worry about
issues such as data availability and dependency enforce-
ment. At the same time, with the possibility of elastically
creating tasks of different sizes within each phase, these
computations can adjust themselves to the dynamic ca-
pacities available in the datacenter. There has been a lot of
prior work in the past decade to leverage this paradigm
for different applications [1,2,3], as well as in the systems
substrate needed to efficiently support their execution at
runtime [4,5,6].
While each phase is highly parallel, the inefficiencies in
MapReduce execution manifest at the boundaries be-
tween the phases as data exchanges and synchronization
stalls, which ensure completion of the prior phases. One
of these inefficiencies is commonly referred to as the
straggler problem of mappers, where a reduce phase has
to wait until all mappers have completed their work [4].
Even if there is one such straggler, the entire computation
is consequently slowed down. Prior work [7,8,9,10] has
identified several reasons for such stragglers including
load imbalance, scheduling inefficiencies, data locality,
communication overheads, etc. There have also been ef-
forts looking to address one or more of these concerns to
mitigate the straggler problem [7,8,11,12,13]. While all
these prior efforts are important, and useful to address
this problem, we believe that a rigorous set of analytical
tools is needed in order to: (i) understand the conse-
quences of stragglers on the performance slowdown in
MapReduce execution; (ii) be able to quantify this slow-
down as a function of different hardware (processing
speed, communication bandwidth, etc.), system (schedul-
ing policy, task to node assignment, data distribution,
etc.), and application (data size, computation needs, etc.)
parameters; (iii) study the impact of different scaling
strategies (number of processing nodes, the computation
to communication and data bandwidths, tasks per node,
etc.) on this slowdown; (iv) undertake “what-if” studies
for different alternatives (alternate scheduling policies,
task assignments to nodes, etc.) beyond what is available
to experiment with on the actual platform/system; and
(v) use such capabilities for a wide range of optimiza-
tions: they could include determining resources (nodes,
their memory capacities, etc.) to provision for the
MapReduce jobs, the number of tasks to create and even
adjust dynamically, the assignment of these tasks to dif-
ferent kinds of nodes (since datacenters could have heter-
ogeneous servers available at a given time), adjusting the
scheduling policies, running redundant versions of the
tasks based on the trade-offs between estimated wait
times and additional resources mandated, executing a
MapReduce computation under a budgetary (perfor-
mance, power, cost) constraint, etc.
xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society
M
————————————————
Farshid Farhat, Diman Zad Tootaghaj, Anand Sivasubramaniam, Mahmut
Kandemir, and Chita R. Das are with the school of electrical engineering
and computer science, the Pennsylvania State University, University Park,
PA, 16802, USA. Email:
{fuf111,dxz149,anand,kandemir,das}@cse.psu.edu.
Yuxiong He is with the Microsoft Research, Redmond, WA 98052 USA.
Email: yuxhe@microsoft.com.