Visualization and Adaptive Subsetting of Earth Science Data in HDFS A Novel Data Analysis Strategy with Hadoop and Spark Xi Yang ∗ , Si Liu ∗ , Kun Feng ∗ , Shujia Zhou † and Xian-He Sun ∗ ∗ Department of Computer Science, Illinois Institute of Technology, Chicago, USA {xyang34, sliu89, kfeng1}@hawk.iit.edu, sun@iit.edu † Northrop Grumman Information Technology, McLean, VA shujia.zhou@ngc.com Abstract—Data analytics becomes increasingly important in big data applications. Adaptively subsetting large amounts of data to extract the interesting events such as the centers of hurricane or thunderstorm, statistically analyzing and visualizing the subset data, is an effective way to analyze ever-growing data. This is particularly crucial for analyzing Earth Science data, such as extreme weather. The Hadoop ecosystem (i.e., HDFS, MapReduce, Hive) provides a cost-efﬁcient big data management environment and is being explored for analyzing big Earth Science data. Our study investigates the potential of a MapReduce-like paradigm to perform statistical calculations, and utilizes the calculated results to subset as well as visualize data in a scalable and efﬁcient way. RHadoop and SparkR are deployed to enable R to access and process data in parallel with Hadoop and Spark, respectively. The regular R libraries and tools are utilized to create and manipulate images. Statistical calculations, such as maximum and average variable values, are carried with R or SQL. We have developed a strategy to conduct query and visualization within one phase, and thus signiﬁcantly improve the overall performance in a scalable way. The technical challenges and limitations of both Hadoop and Spark platforms for R are also discussed. Keywords—Visualization; R; MapReduce; Hadoop; Spark I. I NTRODUCTION Ever-increasing High-Performance Computing (HPC) capa- bilities greatly accelerate scientiﬁc discovery. For example, higher-resolution Earth Science (e.g., climate and weather) simulation can be performed with a longer period of time. Consequently, simulation output data can be easily over Tera- Bytes, which poses a signiﬁcant challenge for conventional data analysis (e.g., visualization, diagnosis, and subsetting) tools based on a single node computer [1]. Earth Science researchers typically use visualization of whole simulation domain to identify the interested events such as the centers of hurricanes or thunderstorms. However, those events are dynamic. Hence, an efﬁcient and scalable analysis platform for adaptively subsetting data out of a huge amount of data is highly desirable. In the past few years, MapReduce [2] has been successful in dealing with big data problems and Hadoop MapReduce framework [3] is the most popular big data ecosystem. It features easy programming, transparent parallelism, and fault tolerance on commodity machines. The in-memory computa- tional engine, Spark [4] [5], alleviates expensive disk I/O for storing intermediates result, signiﬁcantly improves the perfor- mance, especially for interactive and iterative computations. Spark provides rich APIs, including MapReduce, for efﬁcient programming. Nowadays, in the so-called ‘post-Hadoop’ era, the Hadoop Distributed File System (HDFS) [6] [7] is still very powerful to support big data processing in a cost-efﬁcient way, managing massive data in the Hadoop data lake, and is the most popular storage solution to the Apache Big Data Stack (ABDS) [8]. Furthermore, the MapReduce paradigm is also developed and applied to HPC [9] [10] and interactive and real time problems. It has been widely adopted in scientiﬁc researches, such as data mining, graphic processing, and genetic analysis. There are several researches exploring MapReduce paradigm on image plotting [11] and animation generation. Moreover, to address the big data challenges, a hybrid programming model was proposed to potentially exploits the merits of multiple programming models [12]. The current implementation of MapReduce is tightly- coupled with key-value pair processing in terms of program- ming APIs, transparent parallelism support, and optimized I/O system. Consequently, it cannot be directly and efﬁciently applied for image plotting. Earth science researchers often use R [13] for data analysis and visualization. However, R is not designed to exploit parallelism and data locality. The extended R interfaces of the Hadoop ecosystem, RHadoop [14] and SparkR [15], lack efﬁcient strategies to parallelize the R analytic workloads, especially for adaptively subsetting. This paper investigates how R can utilize a MapReduce-like strategy to analyze data in a scalable way, especially for Earth Science data. It identiﬁes and addresses several challenges in utilizing MapReduce for data diagnosis and visualization. The contributions include: • We demonstrate how to encapsulate R image plotting function into MapReduce paradigm and transparently and adequately align tasks to data. 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) 978-1-5090-3936-4/16 $31.00 © 2016 IEEE DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24 88 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) 978-1-5090-3936-4/16 $31.00 © 2016 IEEE DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24 88 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) 978-1-5090-3936-4/16 $31.00 © 2016 IEEE DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24 89 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) 978-1-5090-3936-4/16 $31.00 © 2016 IEEE DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24 89