Performance Comparison of Big-Data Technologies in Locating Intersections in Satellite Ground Tracks Khoa Doan 1,2 , Amidu Oloso 2,3 , Kwo-Sen Kuo 2,4 , Thomas L Clune 2 University of Maryland, Department of Computer Science 1 NASA Goddard Space Flight Center 2 Science Systems and Applications, Inc 3 Bayesics, LLC 4 Email: khoadoan@cs.umd.edu, {amidu.o.oloso, kwo-sen.kuo, Thomas.L.Clune}@nasa.gov Abstract The performance and ease of extensibility for two Big-Data technol- ogies, SciDB and Hadoop/MapReduce (HD/MR), are evaluated on identical hardware for an Earth science use case of locating intersec- tions between two NASA remote sensing satellites’ ground tracks. SciDB is found to be 1.5 to 2.5 times faster than HD/MR. The per- formance of HD/MR approaches that of SciDB as the data size or the cluster size increases. Performance in both SciDB and HD/MR is largely insensitive to the chunk size (i.e., granularity). We have found that it is easier to extend HD/MR than SciDB at this time. Keywords: Multidimensional arrays; MapReduce; intersection algo- rithm; SciDB. 1. Introduction Several emerging Big-Data technologies offer cautious hope to scientists facing the daunting challenge of analyzing datasets of unprecedented volumes in the era of Big Data. While none of these technologies is yet mature enough for routine operational use in scientific research, several are sufficiently robust to warrant further investigation into their potential role in a typical research environment. Although most scientists now have access to powerful computational resources ranging from multi-core laptops to petascale clusters, their personal data analysis workflows seldom exploit the full capabilities of these resources. Performance hence remains largely constrained by serial processing, because exploiting parallelism generally requires additional software engineering skills and resources that typical researchers rarely possess. Further, because the workflows are often unique to each scientific investigation, generic support for parallelism is limited to only a handful of very common analysis patterns. One of the common processes in a scientific workflow is “subsetting”, i.e. the extraction of subsets of research interest from vast volumes of relevant datasets. Parallel database systems are especially adept in such process. While these systems, such as Vertica or Oracle, also facilitate various data analysis tasks, developing analytic capabilities in these systems is often too arduous for many scientists. More recent frameworks provide simple, yet powerful, high-level abstractions and tools that makes it possible for different types of users to work with data efficiently without detailed knowledge of the underlying implementation. Since the publication of MapReduce (MR) [1], data scientists and technologists have tried to adapt and extend it to many data analysis applications in various domains. Hadoop (HD) [2], the open-source version of MapReduce, has thus become the default choice for almost every Big-Data analysis application, but its sub-optimal performance has been noted in a number of scenarios [3, 4]. Recent technological developments, such as SciDB [5], which specifically target multidimensional arrays, are providing an attractive alternative to Hadoop/MapReduce (HD/MR) for scientific data analysis. SciDB, a next-generation array-model parallel database system, not only indexes the data it ingests for fast extraction and retrieval, but also provides an attractive, albeit still basic, mathematical/statistical toolbox for data analysis. Like HD/MR, SciDB exploits the affinity of compute and data. We compare two technologies in this paper, Hadoop and SciDB, in the aspects of 1) performance and 2) ease of implementations, using a common use case in Earth science remote sensing. We first describe our use case scenario in section 2. We elaborate in Section 3 a few key considerations regarding processing ground track arrays, then describe the array data used in Section 4. The Big-Data algorithms used for our evaluation are introduced in Section 5. In Section 6, we describe our hardware platform, detail our experiments, and report results. We conclude the paper with a discussion and our plan for future works. 2. Use Case Description The problems we are facing today with our Earth’s future are complex and carry grave consequences. We need long-term and comprehensive observations of Earth’s conditions to understand this complex system of systems. However, approximately two-thirds of Earth are oceans where direct and dense measurements are difficult to obtain. Remote sensing hence becomes the more cost-effective means for obtaining the measurements required to monitor Earth’s current health and to provide data for the prediction of its future. Remote sensing problems, however, are usually under- constrained. That is, its problem space is often of a higher dimensionality than that covered by the observations of the instruments. To gain better constraints and to reduce ambiguity, scientists strive to obtain as much simultaneous, co- located and independent information as possible concerning the problem space. Our use case is thus to find nearly coincident spaceborne radar measurements of two NASA Earth science 2014 ASE BigData/SocialInformatics/PASSAT/BioMedCom 2014 Conference, Harvard University, December 14-16, 2014 ©ASE 2014 ISBN: 978-1-62561-003-4 1