Contents lists available at ScienceDirect
Computers and Electrical Engineering
journal homepage: www.elsevier.com/locate/compeleceng
MapReduce service provisioning for frequent big data jobs on
clouds considering data transfers
Seyed Morteza Nabavinejad
a,b
, Maziar Goudarzi
⁎
,a
, Saeed Abedi
a,c
a
Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
b
School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
c
Department of Computer and Information Science, University of Pennsylvania, USA
ARTICLE INFO
Keywords:
Big data
MapReduce
Cloud computing
Hadoop
Energy efficiency
ABSTRACT
Many companies regularly run Big Data analysis, and need to optimize their resource usage
considering cost, deadline, and environmental impact simultaneously. The cloud allows choosing
from various virtual machines (VM) where the number and type of VMs affect the outcome such
as the time for data placement and data shuffle phases, a task’s energy consumption and ex-
ecution time, and the makespan of jobs. We provide provisioning and scheduling algorithms to
minimize environmental impact, considering the above factors, for frequently executed
MapReduce jobs. To mathematically model the problem and obtain the optimal solution, we
present an Integer Linear Programming (ILP) model and then continue with two heuristic al-
gorithms. We compare proposed algorithms against a number of rivals using extensive simula-
tions based on publicly available real-world data. The results demonstrate that our algorithms
can achieve near-optimal solutions, e.g., sometime even within 0.39% of the optimal solution
obtained by ILP regarding energy consumption.
1. Introduction
The volume of data is steadily increasing. Forecasts such as [1] predict that the volume of digital data in 2020 will be 300 times
larger than 2005, furthering the significance of Big Data as well as Big Data Analytics which are already important needs. MapReduce
[2], and its open source implementation Hadoop [3], are among the prevalent frameworks for implementing Big Data Analytics and
applications. With the rapid growth of popularity in big data analytics, cloud providers have even launched dedicated services for
MapReduce applications such as Amazon Elastic MapReduce (EMR) service [4]. Private and hybrid clouds as well as MapReduce
services have also gained traction due to the advantages the cloud paradigm provides. Big Data jobs of companies usually involve
financially and/or strategically invaluable data of the company, and hence, transferring this data to public clouds for processing is a
constant source of security concerns for them; consequently, in-house Big Data processing over private clouds is becoming more
widespread and has gained even more significance and attractiveness. Our work deals with such cases of in-house Big Data processing
on a private cloud where the company’s social responsibilities and obligations require it to reduce its environmental impact in
addition to meeting more traditional requirements on cost and deadline.
Using a hybrid/private cloud introduces a number of concerns that arise from the on-demand provisioning nature as well as other
features of the cloud: in addition to considering pay-per-use cost structure of the cloud, the amount of resources in the cloud is often
limited as well as shared among other tasks of the company; consequently, the number and types of available VMs dynamically
https://doi.org/10.1016/j.compeleceng.2018.08.005
Received 14 June 2017; Received in revised form 8 August 2018; Accepted 8 August 2018
⁎
Corresponding author at: Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
E-mail addresses: nabavinejad@ipm.ir (S.M. Nabavinejad), goudarzi@sharif.edu (M. Goudarzi), abedi@seas.upenn.edu (S. Abedi).
Computers and Electrical Engineering 71 (2018) 594–610
0045-7906/ © 2018 Elsevier Ltd. All rights reserved.
T