Bahman Rashidi et al./ Elixir Comp. Sci. & Engg. 53 (2012) 12059-12064 12059
Introduction
The amount of data in our world has been exploding, and
analysing large data sets called big data, will become a key basis
of many researches. Data is being collected and stored at
unprecedented rates. The challenge is not only to store and
manage the vast volume of data (“big data”), but also to analyse
and extract meaningful value from it. There are several
approaches to collecting, storing, processing, and analysing big
data. MapReduce is one of existing mechanisms for big data
processing.
MapReduce is a distributed programming framework
designed to ease the development of scalable data-intensive
applications for large clusters of commodity machines. The
MapReduce distributed data analysis framework model
introduced by Google provides an easy-to-use programming
model that features fault tolerance, automatic parallelization,
scalability and data locality-based optimizations. Due to their
excellent fault tolerance features, MapReduce frameworks are
well-suited for the execution of large distributed jobs in brittle
environments such as commodity clusters and cloud
infrastructures [5][12].
Hadoop MapReduce provides a mechanism for
programmers to leverage the distributed systems for processing
data sets. MapReduce can be divided into two distinct phases:
• Map Phase: Divides the workload into smaller sub
workloads and assigns tasks to Mapper, which processes
each unit block of data. The output of Mapper is a sorted list of
(key, value) pairs. This list is passed (also called shuffling) to
the next phase.
• Reduce: analyses and merges the input to produce the final
output. The final output is written to the HDFS in the cluster.
Cloud computing is a new paradigm for the provision of
computing infrastructure. This paradigm shifts the location of
this infrastructure to the network to reduce the costs associated
with the management of hardware and software resources.
Hence, businesses and users become able to access application
services from anywhere in the world [11].
Characteristics of cloud services like On-demand self-
service, Broad network access, Resource pooling, Rapid
elasticity and Measured Service cause MapReduce take
advantage of cloud infrastructure services and probably cloud is
a good platform for implementation of MapReduce [3] [11].
In this paper we bring out a complete comparison of the two
different implementations of MapReduce programming model
that implemented on top of cloud computing. The rest of the
paper is organized as follows. The cloud computing and cloud
service models are briefly explained. Also the MapReduce and
his architecture are briefly explained and the characteristics of
MapReduce implementation in the cloud environment.At last
discusses and compares two models of cloud MapReduce.
Concluding remarks are presented.
Cloud Computing
The concept of cloud computing addresses the next
evolutionary step distributed computing. The goal of this
computing model is to make a better use of distributed
resources, put them together in order to achieve higher
throughput and be able to tackle large scale computation
problems. Cloud computing is not a completely new concept for
the development and operation of web application. It allows for
the most cost-effective development of scalable web portals on
highly available and fail-safe infrastructure [1].
Cloud computing deals with virtualization, scalability,
interoperability, quality of service and the delivery models of
the cloud, namely private, public and hybrid.
A more structured definition is given by Buyya et al [2]:
who define a Cloud as a “type of parallel and distributed
system consisting of a collection of interconnected and
Tele:
E-mail addresses: b_rashidi@comp.iust.ac.ir
© 2012 Elixir All rights reserved
A Comparison of Amazon Elastic Mapreduce and Azure Mapreduce
Bahman Rashidi
1
, Esmail Asyabi
1
and Talie Jafari
2
1
Iran University of Science and Technology (IUST).
2
Amirkabir University of Technology.
ABSTRACT
In last two decades continues increase of comput-ational power and recent advance in the
web technology cause to provide large amounts of data. That needs large scale data
processing mechanism to handle this volume of data. MapReduce is a programming model
for large scale distributed data processing in an efficient and transparent way. Due to its
excellent fault tolerance features, scalability and the ease of use. Currently, there are several
options for using MapReduce in cloud environments, such as using MapReduce as a service,
setting up one’s own MapReduce cluster on cloud instances, or using specialized cloud
MapReduce runtimes that take advantage of cloud infrastructure services. Cloud computing
has recently emerged as a new paradigm that provide computing infrastructure and large
scale data processing mechanism in the network. The cloud is on demand, scalable and high
availability so implement of MapReduce on the top of cloud services cause faster, scalable
and high available MapReduce framework for large scale data processing. In this paper we
explain how to implement MapReduce in the cloud and also have a comparison between
implementations of MapReduce on AzureCloud, Amazon Cloud and Hadoop at the end.
© 2012 Elixir All rights reserved.
ARTICLE INFO
Article history:
Received: 18 October 2012;
Received in revised form:
7 December 2012;
Accepted: 14 December 2012;
Keywords
Cloud computing.
Mapreduce.
Cloud mapreduce.
Azure mapreduce.
Amazon elastic mapreduce.
Elixir Comp. Sci. & Engg. 53 (2012) 12059-12064
Computer Science and Engineering
Available online at www.elixirpublishers.com (Elixir International Journal)