Vol.:(0123456789) The Journal of Supercomputing https://doi.org/10.1007/s11227-020-03162-9 1 3 Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach Abolfazl Gandomi 1  · Ali Movaghar 2  · Midia Reshadi 1  · Ahmad Khademzadeh 3 © Springer Science+Business Media, LLC, part of Springer Nature 2020 Abstract MapReduce framework is an effective method for big data parallel processing. Enhancing the performance of MapReduce clusters, along with reducing their job execution time, is a fundamental challenge to this approach. In fact, one is faced with two challenges here: how to maximize the execution overlap between jobs and how to create an optimum job scheduling. Accordingly, one of the most criti- cal challenges to achieving these goals is developing a precise model to estimate the job execution time due to the large number and high volume of the submitted jobs, limited consumable resources, and the need for proper Hadoop configuration. This paper presents a model based on MapReduce phases for predicting the execu- tion time of jobs in a heterogeneous cluster. Moreover, a novel heuristic method is designed, which significantly reduces the makespan of the jobs. In this method, first by providing the job profiling tool, we obtain the execution details of the MapRe- duce phases through log analysis. Then, using machine learning methods and sta- tistical analysis, we propose a relevant model to predict runtime. Finally, another tool called job submission and monitoring tool is used for calculating makespan. Different experiments were conducted on the benchmarks under identical conditions for all jobs. The results show that the average makespan speedup for the proposed method was higher than an unoptimized case. Keywords MapReduce · YARN · Hadoop · Scheduling · Modeling · Makespan 1 Introduction Big data is a set of large datasets that cannot be processed traditionally. Dis- tributed data processing systems are used as a necessity to process such data [1]. MapReduce [2] is a programming model for data processing. MapReduce * Ali Movaghar movaghar@sharif.edu Extended author information available on the last page of the article