Vol.:(0123456789)
The Journal of Supercomputing
https://doi.org/10.1007/s11227-020-03162-9
1 3
Designing a MapReduce performance model in distributed
heterogeneous platforms based on benchmarking
approach
Abolfazl Gandomi
1
· Ali Movaghar
2
· Midia Reshadi
1
· Ahmad Khademzadeh
3
© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
MapReduce framework is an effective method for big data parallel processing.
Enhancing the performance of MapReduce clusters, along with reducing their job
execution time, is a fundamental challenge to this approach. In fact, one is faced
with two challenges here: how to maximize the execution overlap between jobs
and how to create an optimum job scheduling. Accordingly, one of the most criti-
cal challenges to achieving these goals is developing a precise model to estimate
the job execution time due to the large number and high volume of the submitted
jobs, limited consumable resources, and the need for proper Hadoop configuration.
This paper presents a model based on MapReduce phases for predicting the execu-
tion time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is
designed, which significantly reduces the makespan of the jobs. In this method, first
by providing the job profiling tool, we obtain the execution details of the MapRe-
duce phases through log analysis. Then, using machine learning methods and sta-
tistical analysis, we propose a relevant model to predict runtime. Finally, another
tool called job submission and monitoring tool is used for calculating makespan.
Different experiments were conducted on the benchmarks under identical conditions
for all jobs. The results show that the average makespan speedup for the proposed
method was higher than an unoptimized case.
Keywords MapReduce · YARN · Hadoop · Scheduling · Modeling · Makespan
1 Introduction
Big data is a set of large datasets that cannot be processed traditionally. Dis-
tributed data processing systems are used as a necessity to process such data
[1]. MapReduce [2] is a programming model for data processing. MapReduce
* Ali Movaghar
movaghar@sharif.edu
Extended author information available on the last page of the article