Contents lists available at ScienceDirect Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng MapReduce service provisioning for frequent big data jobs on clouds considering data transfers Seyed Morteza Nabavinejad a,b , Maziar Goudarzi ⁎ ,a , Saeed Abedi a,c a Department of Computer Engineering, Sharif University of Technology, Tehran, Iran b School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran c Department of Computer and Information Science, University of Pennsylvania, USA ARTICLE INFO Keywords: Big data MapReduce Cloud computing Hadoop Energy eﬃciency ABSTRACT Many companies regularly run Big Data analysis, and need to optimize their resource usage considering cost, deadline, and environmental impact simultaneously. The cloud allows choosing from various virtual machines (VM) where the number and type of VMs aﬀect the outcome such as the time for data placement and data shuﬄe phases, a task’s energy consumption and ex- ecution time, and the makespan of jobs. We provide provisioning and scheduling algorithms to minimize environmental impact, considering the above factors, for frequently executed MapReduce jobs. To mathematically model the problem and obtain the optimal solution, we present an Integer Linear Programming (ILP) model and then continue with two heuristic al- gorithms. We compare proposed algorithms against a number of rivals using extensive simula- tions based on publicly available real-world data. The results demonstrate that our algorithms can achieve near-optimal solutions, e.g., sometime even within 0.39% of the optimal solution obtained by ILP regarding energy consumption. 1. Introduction The volume of data is steadily increasing. Forecasts such as [1] predict that the volume of digital data in 2020 will be 300 times larger than 2005, furthering the signiﬁcance of Big Data as well as Big Data Analytics which are already important needs. MapReduce [2], and its open source implementation Hadoop [3], are among the prevalent frameworks for implementing Big Data Analytics and applications. With the rapid growth of popularity in big data analytics, cloud providers have even launched dedicated services for MapReduce applications such as Amazon Elastic MapReduce (EMR) service [4]. Private and hybrid clouds as well as MapReduce services have also gained traction due to the advantages the cloud paradigm provides. Big Data jobs of companies usually involve ﬁnancially and/or strategically invaluable data of the company, and hence, transferring this data to public clouds for processing is a constant source of security concerns for them; consequently, in-house Big Data processing over private clouds is becoming more widespread and has gained even more signiﬁcance and attractiveness. Our work deals with such cases of in-house Big Data processing on a private cloud where the company’s social responsibilities and obligations require it to reduce its environmental impact in addition to meeting more traditional requirements on cost and deadline. Using a hybrid/private cloud introduces a number of concerns that arise from the on-demand provisioning nature as well as other features of the cloud: in addition to considering pay-per-use cost structure of the cloud, the amount of resources in the cloud is often limited as well as shared among other tasks of the company; consequently, the number and types of available VMs dynamically https://doi.org/10.1016/j.compeleceng.2018.08.005 Received 14 June 2017; Received in revised form 8 August 2018; Accepted 8 August 2018 ⁎ Corresponding author at: Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. E-mail addresses: nabavinejad@ipm.ir (S.M. Nabavinejad), goudarzi@sharif.edu (M. Goudarzi), abedi@seas.upenn.edu (S. Abedi). Computers and Electrical Engineering 71 (2018) 594–610 0045-7906/ © 2018 Elsevier Ltd. All rights reserved. T