Visualization and Adaptive Subsetting of Earth
Science Data in HDFS
A Novel Data Analysis Strategy with Hadoop and Spark
Xi Yang
∗
, Si Liu
∗
, Kun Feng
∗
, Shujia Zhou
†
and Xian-He Sun
∗
∗
Department of Computer Science, Illinois Institute of Technology, Chicago, USA
{xyang34, sliu89, kfeng1}@hawk.iit.edu, sun@iit.edu
†
Northrop Grumman Information Technology, McLean, VA
shujia.zhou@ngc.com
Abstract—Data analytics becomes increasingly important in
big data applications. Adaptively subsetting large amounts of
data to extract the interesting events such as the centers of
hurricane or thunderstorm, statistically analyzing and visualizing
the subset data, is an effective way to analyze ever-growing data.
This is particularly crucial for analyzing Earth Science data,
such as extreme weather. The Hadoop ecosystem (i.e., HDFS,
MapReduce, Hive) provides a cost-efficient big data management
environment and is being explored for analyzing big Earth
Science data.
Our study investigates the potential of a MapReduce-like
paradigm to perform statistical calculations, and utilizes the
calculated results to subset as well as visualize data in a scalable
and efficient way. RHadoop and SparkR are deployed to enable
R to access and process data in parallel with Hadoop and
Spark, respectively. The regular R libraries and tools are utilized
to create and manipulate images. Statistical calculations, such
as maximum and average variable values, are carried with R
or SQL. We have developed a strategy to conduct query and
visualization within one phase, and thus significantly improve the
overall performance in a scalable way. The technical challenges
and limitations of both Hadoop and Spark platforms for R are
also discussed.
Keywords—Visualization; R; MapReduce; Hadoop; Spark
I. I NTRODUCTION
Ever-increasing High-Performance Computing (HPC) capa-
bilities greatly accelerate scientific discovery. For example,
higher-resolution Earth Science (e.g., climate and weather)
simulation can be performed with a longer period of time.
Consequently, simulation output data can be easily over Tera-
Bytes, which poses a significant challenge for conventional
data analysis (e.g., visualization, diagnosis, and subsetting)
tools based on a single node computer [1].
Earth Science researchers typically use visualization of
whole simulation domain to identify the interested events such
as the centers of hurricanes or thunderstorms. However, those
events are dynamic. Hence, an efficient and scalable analysis
platform for adaptively subsetting data out of a huge amount
of data is highly desirable.
In the past few years, MapReduce [2] has been successful
in dealing with big data problems and Hadoop MapReduce
framework [3] is the most popular big data ecosystem. It
features easy programming, transparent parallelism, and fault
tolerance on commodity machines. The in-memory computa-
tional engine, Spark [4] [5], alleviates expensive disk I/O for
storing intermediates result, significantly improves the perfor-
mance, especially for interactive and iterative computations.
Spark provides rich APIs, including MapReduce, for efficient
programming. Nowadays, in the so-called ‘post-Hadoop’ era,
the Hadoop Distributed File System (HDFS) [6] [7] is still very
powerful to support big data processing in a cost-efficient way,
managing massive data in the Hadoop data lake, and is the
most popular storage solution to the Apache Big Data Stack
(ABDS) [8]. Furthermore, the MapReduce paradigm is also
developed and applied to HPC [9] [10] and interactive and
real time problems. It has been widely adopted in scientific
researches, such as data mining, graphic processing, and
genetic analysis.
There are several researches exploring MapReduce
paradigm on image plotting [11] and animation generation.
Moreover, to address the big data challenges, a hybrid
programming model was proposed to potentially exploits the
merits of multiple programming models [12].
The current implementation of MapReduce is tightly-
coupled with key-value pair processing in terms of program-
ming APIs, transparent parallelism support, and optimized I/O
system. Consequently, it cannot be directly and efficiently
applied for image plotting. Earth science researchers often
use R [13] for data analysis and visualization. However, R
is not designed to exploit parallelism and data locality. The
extended R interfaces of the Hadoop ecosystem, RHadoop [14]
and SparkR [15], lack efficient strategies to parallelize the R
analytic workloads, especially for adaptively subsetting.
This paper investigates how R can utilize a MapReduce-like
strategy to analyze data in a scalable way, especially for Earth
Science data. It identifies and addresses several challenges in
utilizing MapReduce for data diagnosis and visualization. The
contributions include:
• We demonstrate how to encapsulate R image plotting
function into MapReduce paradigm and transparently and
adequately align tasks to data.
2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom)
978-1-5090-3936-4/16 $31.00 © 2016 IEEE
DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24
88
2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom)
978-1-5090-3936-4/16 $31.00 © 2016 IEEE
DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24
88
2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom)
978-1-5090-3936-4/16 $31.00 © 2016 IEEE
DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24
89
2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking
(SocialCom), Sustainable Computing and Communications (SustainCom)
978-1-5090-3936-4/16 $31.00 © 2016 IEEE
DOI 10.1109/BDCloud-SocialCom-SustainCom.2016.24
89