Ophidia: a full software stack for scientific data
analytics
Sandro Fiore, Alessandro D’Anca, Donatello Elia,
Cosimo Palazzo
Centro Euro-Mediterraneo sui Cambiamenti Climatici
Lecce, Italy
Dean Williams
Lawrence Livermore National Laboratory
Livermore, USA
Ian Foster
University of Chicago & Argonne National Laboratory
Chicago, USA
Giovanni Aloisio
University of Salento & Centro Euro-Mediterraneo sui
Cambiamenti Climatici
Lecce, Italy
Abstract—The Ophidia project aims to provide a big data
analytics platform solution that addresses scientific use cases
related to large volumes of multidimensional data. In this work,
the Ophidia software infrastructure is discussed in detail,
presenting the entire software stack from level-0 (the Ophidia
data store) to level-3 (the Ophidia web service front end). In
particular, this paper presents the big data cube primitives
provided by the Ophidia framework, discussing in detail the most
relevant and available data cube manipulation operators. These
primitives represent the proper foundations to build more
complex data cube operators like the apex one presented in this
paper. A massive data reduction experiment on a 1TB climate
dataset is also presented to demonstrate the apex workflow in the
context of the proposed framework.
Keywords-multidimensional data; big data; data analytics;
scientific workflow, software infrastructure.
I. INTRODUCTION
Analyzing multi dimensional data is a key task to enable
scientific discovery in several domains [1,2], like climate
change [3], astrophysics [4] and life sciences [5]. The large
data volumes and production rates in these domains need
adequate solutions to address scalable, robust and flexible data
processing, transformation, analysis and visualization [6], [7].
In this context, the Ophidia [8] project aims at providing a
big data analytics solution running data intensive tasks on High
Performance Computing (HPC) machines to address scientific
analysis needs and issues in (near) real-time. Such a big data
analytics solution gives the user the data cube abstraction to
deal with high dimensional data. It also provides both a
comprehensive set of data cube primitives to manipulate the
multi dimensional datasets and a dataflow support to build
more complex operators on top of and by composing the core
ones. It relies on a data parallel programming model based on a
seven-step algorithm described in [9].
The Ophidia platform is being primarily used in the climate
change domain to carry out scientific analysis of the global
climate simulations produced at the Euro-Mediterranean Centre
on Climate Change, in the context of the international Coupled
Model Intercomparison Project Phase 5 (CMIP5) [10,11].
With regard to existing solutions like ParCAT focusing on
climate data, Ophidia provides a domain-agnostic solution.
Moreover, while ParCAT [12] is a toolkit, Ophidia is rather a
framework. Other solutions, like CDO [13] or NCO [14], are
sequential, command line based and designed for climate data,
while Ophidia is mainly a framework with parallel operators
working on multidimensional data (not only climate). It is also
important to state that, to date, the support in terms of available
operators in the Ophidia framework is not so broad and
complete as the one provided by CDO.
Other related projects are SciDB [15] and Rasdaman [16].
Some commonalities with Ophidia relate to the management of
n-dimensional arrays, server-side analysis, and declarative
approach. Unlike the other two systems, the Ophidia design is
strongly based on parallel datawarehouse concepts and
architectures, with a key focus on the data cube concept, thus
providing a strong support in terms of data cube operators, data
cube metadata and data cube centric logical file system. On the
other hand, Ophidia does not support versioning (as provided
by SciDB instead), nor does it provide OGC (Open Geospatial
Consortium) [17] compliant interfaces (WCS, WCPS, WCS-T,
and WPS) as in Rasdaman.
With regard to well-known big data frameworks like
Hadoop [18], Ophidia is designed to provide a native support
for scientific data management in terms of array data types and
primitives, scientific data formats and numerical libraries,
metadata and provenance support, dataflows and experiment-
centric logical file system.
A comprehensive benchmark comparing Ophidia with other
related projects is out of the scope of this paper and will be
presented in a future work.
The main goal of this paper is to present and discuss in
detail the Ophidia software infrastructure from the level-0,
related to the Ophidia data store, to the level-3, connected with
This work was supported by the Italian Ministry of Education, Universities
and Research (MIUR) under the GEMINA contract, the EU FP7 EUBrazilCC
Project (Grant Agreement 614048) and by the U.S. Department of Energy,
Office of Science (Contract DE-AC02-06CH11357).
978-1-4799-5313-4/14/$31.00 ©2014 IEEE 343