Performance Comparison of Big-Data Technologies in Locating
Intersections in Satellite Ground Tracks
Khoa Doan
1,2
, Amidu Oloso
2,3
, Kwo-Sen Kuo
2,4
, Thomas L Clune
2
University of Maryland, Department of Computer Science
1
NASA Goddard Space Flight Center
2
Science Systems and Applications, Inc
3
Bayesics, LLC
4
Email: khoadoan@cs.umd.edu, {amidu.o.oloso, kwo-sen.kuo, Thomas.L.Clune}@nasa.gov
Abstract
The performance and ease of extensibility for two Big-Data technol-
ogies, SciDB and Hadoop/MapReduce (HD/MR), are evaluated on
identical hardware for an Earth science use case of locating intersec-
tions between two NASA remote sensing satellites’ ground tracks.
SciDB is found to be 1.5 to 2.5 times faster than HD/MR. The per-
formance of HD/MR approaches that of SciDB as the data size or the
cluster size increases. Performance in both SciDB and HD/MR is
largely insensitive to the chunk size (i.e., granularity). We have found
that it is easier to extend HD/MR than SciDB at this time.
Keywords: Multidimensional arrays; MapReduce; intersection algo-
rithm; SciDB.
1. Introduction
Several emerging Big-Data technologies offer cautious hope to
scientists facing the daunting challenge of analyzing datasets of
unprecedented volumes in the era of Big Data. While none of
these technologies is yet mature enough for routine operational
use in scientific research, several are sufficiently robust to
warrant further investigation into their potential role in a
typical research environment.
Although most scientists now have access to powerful
computational resources ranging from multi-core laptops to
petascale clusters, their personal data analysis workflows
seldom exploit the full capabilities of these resources.
Performance hence remains largely constrained by serial
processing, because exploiting parallelism generally requires
additional software engineering skills and resources that typical
researchers rarely possess. Further, because the workflows are
often unique to each scientific investigation, generic support
for parallelism is limited to only a handful of very common
analysis patterns.
One of the common processes in a scientific workflow is
“subsetting”, i.e. the extraction of subsets of research interest
from vast volumes of relevant datasets. Parallel database
systems are especially adept in such process. While these
systems, such as Vertica or Oracle, also facilitate various data
analysis tasks, developing analytic capabilities in these systems
is often too arduous for many scientists. More recent
frameworks provide simple, yet powerful, high-level
abstractions and tools that makes it possible for different types
of users to work with data efficiently without detailed
knowledge of the underlying implementation.
Since the publication of MapReduce (MR) [1], data
scientists and technologists have tried to adapt and extend it to
many data analysis applications in various domains. Hadoop
(HD) [2], the open-source version of MapReduce, has thus
become the default choice for almost every Big-Data analysis
application, but its sub-optimal performance has been noted in
a number of scenarios [3, 4].
Recent technological developments, such as SciDB [5],
which specifically target multidimensional arrays, are
providing an attractive alternative to Hadoop/MapReduce
(HD/MR) for scientific data analysis. SciDB, a next-generation
array-model parallel database system, not only indexes the data
it ingests for fast extraction and retrieval, but also provides an
attractive, albeit still basic, mathematical/statistical toolbox for
data analysis. Like HD/MR, SciDB exploits the affinity of
compute and data.
We compare two technologies in this paper, Hadoop and
SciDB, in the aspects of 1) performance and 2) ease of
implementations, using a common use case in Earth science
remote sensing. We first describe our use case scenario in
section 2. We elaborate in Section 3 a few key considerations
regarding processing ground track arrays, then describe the
array data used in Section 4. The Big-Data algorithms used for
our evaluation are introduced in Section 5. In Section 6, we
describe our hardware platform, detail our experiments, and
report results. We conclude the paper with a discussion and our
plan for future works.
2. Use Case Description
The problems we are facing today with our Earth’s future are
complex and carry grave consequences. We need long-term
and comprehensive observations of Earth’s conditions to
understand this complex system of systems. However,
approximately two-thirds of Earth are oceans where direct and
dense measurements are difficult to obtain. Remote sensing
hence becomes the more cost-effective means for obtaining the
measurements required to monitor Earth’s current health and to
provide data for the prediction of its future.
Remote sensing problems, however, are usually under-
constrained. That is, its problem space is often of a higher
dimensionality than that covered by the observations of the
instruments. To gain better constraints and to reduce
ambiguity, scientists strive to obtain as much simultaneous, co-
located and independent information as possible concerning the
problem space. Our use case is thus to find nearly coincident
spaceborne radar measurements of two NASA Earth science
2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014
©ASE 2014 ISBN: 978-1-62561-003-4 1