DOI: 10.4018/IJGHPC.2018100101 International Journal of Grid and High Performance Computing Volume 10 • Issue 4 • October-December 2018 Copyright © 2018, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 1 A Predictive Map Task Scheduler for Optimizing Data Locality in MapReduce Clusters Mohamed Merabet, EEDIS Laboratory, DjillaliLiabes University of Sidi Bel Abbes, Sidi Bel Abbes, Algeria Sidi mohamed Benslimane, LabRI Laboratory, Ecole Superieure en Informatique, Sidi Bel-Abbes, Algeria Mahmoud Barhamgi, Claude Bernard Lyon 1 University, Lyon, France Christine Bonnet, Université Lyon 1, Lyon, France ABSTRACT This article describes how data locality is becoming one of the most critical factors to affect performance of MapReduce clusters because of network bisection bandwidth becomes a bottleneck. Task scheduler assigns the most appropriate map tasks to nodes. If map tasks are scheduled to nodes without input data, these tasks will issue remote I/O operations to copy the data to local nodes that decrease execution time of map tasks. In that case, prefetching mechanism can be useful to preload the needed input data before tasks is launching. Therefore, the key challenge is how this article can accurately predict the execution time of map tasks to be able to use data prefetching effectively without any data access delay. In this article, it is proposed that a Predictive Map Task Scheduler assigns the most suitable map tasks to nodes ahead of time. Following this, a linear regression model is used for prediction and data locality based algorithm for tasks scheduling. The experimental results show that the method can greatly improve both data locality and execution time of map tasks. KeywORDS Data Locality, Execution Time Prediction, Map Task Scheduling, MapReduce, Prefetching 1. INTRODUCTION The increasing amount of data generated by commercial and scientific applications such as social networks, scientific research, and recently Internet of Things, has become an important and challenging problem (Min et. al., 2014; Marcos et. al., 2015). New scalable programming paradigms and complex scheduling algorithms for efficiently processing such big data applications are a necessity for achieving good performance. Hadoop, an open source implementation of the MapReduce model, has emerged as one of the most used tools, due to its extremely easy to use program, fast speed, scalability, and fault-tolerance. Several companies such as Google, Facebook, Microsoft, IBM, Amazon and many others have started using Hadoop for processing large-scale data volumes in moderate time. Data locality is an important factor impacting the efficiency of task scheduling (Ching-Hsien et. al., 2015; Shabeera et. al., 2015). In Hadoop, data are distributed and stored locally on nodes. Tasks are also deployed to all nodes independently form data. To execute a map task on a node without local data inputs, the node needs to transfer data from remote data providers, which delays the execution of