Big Data Processing in Cloud Computing Environments Changqing Ji ∗† , Yu Li , Wenming Qiu , Uchechukwu Awada , Keqiu Li College of Information Science and Technology, Dalian Maritime University, Dalian 116026, China College of Physical Science and Technology, Dalian University, Dalian 116600, China School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China Email:{jcqgood, liyu87122, xmdlut2007, awadauche, likeqiu}@gmail.com Abstract—With the rapid growth of emerging applications like social network analysis, semantic Web analysis and bioin- formatics network analysis, a variety of data to be processed continues to witness a quick increase. Effective management and analysis of large-scale data poses an interesting but critical challenge. Recently, big data has attracted a lot of attention from academia, industry as well as government. This paper introduces several big data processing technics from system and application aspects. First, from the view of cloud data management and big data processing mechanisms, we present the key issues of big data processing, including cloud computing platform, cloud architecture, cloud database and data storage scheme. Following the MapReduce parallel processing frame- work, we then introduce MapReduce optimization strategies and applications reported in the literature. Finally, we discuss the open issues and challenges, and deeply explore the research directions in the future on big data processing in cloud computing environments. Keywords-Big Data; Cloud Computing; Data Management; Distributed Computing. I. I NTRODUCTION In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data. Big data is not only becoming more available but also more understandable to computers. For example, modern high-energy physics experiments, such as DZero 1 , typically generate more than one TeraByte of data per day. The famous social network Website, Facebook, serves 570 billion page views per month, stores 3 billion new photos every month, and manages 25 billion pieces of content 2 . Google’s search and ad business, Facebook, Flickr, YouTube, and Linkedin use a bundle of artificial-intelligence tricks, require parsing vast quantities of data and making decisions instantaneously. Multimedia data mining platforms make it easy for everybody to achieve these goals with the minimum amount of effort in terms of software, CPU and network. On March 29, 2012, American government announced the “Big Data Research and Development Initiative”, and big data becomes the national policy for the first time 3 . All these examples showed that daunting big data challenges and 1 http://www-d0.fnal.gov/ 2 http://www.facebook.com 3 http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal significant resources were allocated to support these data- intensive operations which lead to high storage and data processing costs. The current technologies such as grid and cloud comput- ing have all intended to access large amounts of comput- ing power by aggregating resources and offering a single system view. Among these technologies, cloud computing is becoming a powerful architecture to perform large-scale and complex computing, and has revolutionized the way that computing infrastructure is abstracted and used. In addition, an important aim of these technologies is to deliver computing as a solution for tackling big data, such as large- scale, multi-media and high dimensional data sets. Big data and cloud computing are both the fastest-moving technologies identified in Gartner Inc.’s 2012 Hype Cycle for Emerging Technologies 4 . Cloud computing is associated with new paradigm for the provision of computing infras- tructure and big data processing method for all kinds of resources. Moreover, some new cloud-based technologies have to be adopted because dealing with big data for concurrent processing is difficult. Then what is Big Data? In the publication of the journal of Science 2008, “Big Data” is defined as “Represents the progress of the human cognitive processes, usually includes data sets with sizes beyond the ability of current technology, method and theory to capture, manage, and process the data within a tolerable elapsed time”[1]. Recently, the definition of big data as also given by the Gartner: “Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization”[2]. According to Wikimedia, “In information technology, big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools” 5 . The goal of this paper is to provide the status of big data studies and related works, which aims at providing a general view of big data management technologies and 4 http://www.gartner.com 5 http://en.wikipedia.org/wiki/Big-data 2012 International Symposium on Pervasive Systems, Algorithms and Networks 1087-4089/12 $26.00 © 2012 IEEE DOI 10.1109/I-SPAN.2012.9 17