DOI: 10.4018/IJGHPC.2018100101
International Journal of Grid and High Performance Computing
Volume 10 • Issue 4 • October-December 2018
Copyright © 2018, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
1
A Predictive Map Task Scheduler
for Optimizing Data Locality
in MapReduce Clusters
Mohamed Merabet, EEDIS Laboratory, DjillaliLiabes University of Sidi Bel Abbes, Sidi Bel Abbes, Algeria
Sidi mohamed Benslimane, LabRI Laboratory, Ecole Superieure en Informatique, Sidi Bel-Abbes, Algeria
Mahmoud Barhamgi, Claude Bernard Lyon 1 University, Lyon, France
Christine Bonnet, Université Lyon 1, Lyon, France
ABSTRACT
This article describes how data locality is becoming one of the most critical factors to affect
performance of MapReduce clusters because of network bisection bandwidth becomes a bottleneck.
Task scheduler assigns the most appropriate map tasks to nodes. If map tasks are scheduled to nodes
without input data, these tasks will issue remote I/O operations to copy the data to local nodes that
decrease execution time of map tasks. In that case, prefetching mechanism can be useful to preload
the needed input data before tasks is launching. Therefore, the key challenge is how this article can
accurately predict the execution time of map tasks to be able to use data prefetching effectively without
any data access delay. In this article, it is proposed that a Predictive Map Task Scheduler assigns the
most suitable map tasks to nodes ahead of time. Following this, a linear regression model is used for
prediction and data locality based algorithm for tasks scheduling. The experimental results show that
the method can greatly improve both data locality and execution time of map tasks.
KeywORDS
Data Locality, Execution Time Prediction, Map Task Scheduling, MapReduce, Prefetching
1. INTRODUCTION
The increasing amount of data generated by commercial and scientific applications such as social
networks, scientific research, and recently Internet of Things, has become an important and challenging
problem (Min et. al., 2014; Marcos et. al., 2015). New scalable programming paradigms and complex
scheduling algorithms for efficiently processing such big data applications are a necessity for achieving
good performance.
Hadoop, an open source implementation of the MapReduce model, has emerged as one of the
most used tools, due to its extremely easy to use program, fast speed, scalability, and fault-tolerance.
Several companies such as Google, Facebook, Microsoft, IBM, Amazon and many others have started
using Hadoop for processing large-scale data volumes in moderate time.
Data locality is an important factor impacting the efficiency of task scheduling (Ching-Hsien et.
al., 2015; Shabeera et. al., 2015). In Hadoop, data are distributed and stored locally on nodes. Tasks
are also deployed to all nodes independently form data. To execute a map task on a node without local
data inputs, the node needs to transfer data from remote data providers, which delays the execution of