Big Data Processing in Cloud Computing Environments
Changqing Ji
∗†
, Yu Li
‡
, Wenming Qiu
‡
, Uchechukwu Awada
‡
, Keqiu Li
‡
∗
College of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
†
College of Physical Science and Technology, Dalian University, Dalian 116600, China
‡
School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Email:{jcqgood, liyu87122, xmdlut2007, awadauche, likeqiu}@gmail.com
Abstract—With the rapid growth of emerging applications
like social network analysis, semantic Web analysis and bioin-
formatics network analysis, a variety of data to be processed
continues to witness a quick increase. Effective management
and analysis of large-scale data poses an interesting but critical
challenge. Recently, big data has attracted a lot of attention
from academia, industry as well as government. This paper
introduces several big data processing technics from system
and application aspects. First, from the view of cloud data
management and big data processing mechanisms, we present
the key issues of big data processing, including cloud computing
platform, cloud architecture, cloud database and data storage
scheme. Following the MapReduce parallel processing frame-
work, we then introduce MapReduce optimization strategies
and applications reported in the literature. Finally, we discuss
the open issues and challenges, and deeply explore the research
directions in the future on big data processing in cloud
computing environments.
Keywords-Big Data; Cloud Computing; Data Management;
Distributed Computing.
I. I NTRODUCTION
In the last two decades, the continuous increase of
computational power has produced an overwhelming flow
of data. Big data is not only becoming more available
but also more understandable to computers. For example,
modern high-energy physics experiments, such as DZero
1
,
typically generate more than one TeraByte of data per day.
The famous social network Website, Facebook, serves 570
billion page views per month, stores 3 billion new photos
every month, and manages 25 billion pieces of content
2
.
Google’s search and ad business, Facebook, Flickr, YouTube,
and Linkedin use a bundle of artificial-intelligence tricks,
require parsing vast quantities of data and making decisions
instantaneously. Multimedia data mining platforms make it
easy for everybody to achieve these goals with the minimum
amount of effort in terms of software, CPU and network.
On March 29, 2012, American government announced the
“Big Data Research and Development Initiative”, and big
data becomes the national policy for the first time
3
. All
these examples showed that daunting big data challenges and
1
http://www-d0.fnal.gov/
2
http://www.facebook.com
3
http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal
significant resources were allocated to support these data-
intensive operations which lead to high storage and data
processing costs.
The current technologies such as grid and cloud comput-
ing have all intended to access large amounts of comput-
ing power by aggregating resources and offering a single
system view. Among these technologies, cloud computing
is becoming a powerful architecture to perform large-scale
and complex computing, and has revolutionized the way
that computing infrastructure is abstracted and used. In
addition, an important aim of these technologies is to deliver
computing as a solution for tackling big data, such as large-
scale, multi-media and high dimensional data sets.
Big data and cloud computing are both the fastest-moving
technologies identified in Gartner Inc.’s 2012 Hype Cycle
for Emerging Technologies
4
. Cloud computing is associated
with new paradigm for the provision of computing infras-
tructure and big data processing method for all kinds of
resources. Moreover, some new cloud-based technologies
have to be adopted because dealing with big data for
concurrent processing is difficult.
Then what is Big Data? In the publication of the journal
of Science 2008, “Big Data” is defined as “Represents the
progress of the human cognitive processes, usually includes
data sets with sizes beyond the ability of current technology,
method and theory to capture, manage, and process the data
within a tolerable elapsed time”[1]. Recently, the definition
of big data as also given by the Gartner: “Big Data are
high-volume, high-velocity, and/or high-variety information
assets that require new forms of processing to enable
enhanced decision making, insight discovery and process
optimization”[2]. According to Wikimedia, “In information
technology, big data is a collection of data sets so large and
complex that it becomes difficult to process using on-hand
database management tools”
5
.
The goal of this paper is to provide the status of big
data studies and related works, which aims at providing
a general view of big data management technologies and
4
http://www.gartner.com
5
http://en.wikipedia.org/wiki/Big-data
2012 International Symposium on Pervasive Systems, Algorithms and Networks
1087-4089/12 $26.00 © 2012 IEEE
DOI 10.1109/I-SPAN.2012.9
17