Ophidia: a full software stack for scientific data analytics Sandro Fiore, Alessandro D’Anca, Donatello Elia, Cosimo Palazzo Centro Euro-Mediterraneo sui Cambiamenti Climatici Lecce, Italy Dean Williams Lawrence Livermore National Laboratory Livermore, USA Ian Foster University of Chicago & Argonne National Laboratory Chicago, USA Giovanni Aloisio University of Salento & Centro Euro-Mediterraneo sui Cambiamenti Climatici Lecce, Italy Abstract—The Ophidia project aims to provide a big data analytics platform solution that addresses scientific use cases related to large volumes of multidimensional data. In this work, the Ophidia software infrastructure is discussed in detail, presenting the entire software stack from level-0 (the Ophidia data store) to level-3 (the Ophidia web service front end). In particular, this paper presents the big data cube primitives provided by the Ophidia framework, discussing in detail the most relevant and available data cube manipulation operators. These primitives represent the proper foundations to build more complex data cube operators like the apex one presented in this paper. A massive data reduction experiment on a 1TB climate dataset is also presented to demonstrate the apex workflow in the context of the proposed framework. Keywords-multidimensional data; big data; data analytics; scientific workflow, software infrastructure. I. INTRODUCTION Analyzing multi dimensional data is a key task to enable scientific discovery in several domains [1,2], like climate change [3], astrophysics [4] and life sciences [5]. The large data volumes and production rates in these domains need adequate solutions to address scalable, robust and flexible data processing, transformation, analysis and visualization [6], [7]. In this context, the Ophidia [8] project aims at providing a big data analytics solution running data intensive tasks on High Performance Computing (HPC) machines to address scientific analysis needs and issues in (near) real-time. Such a big data analytics solution gives the user the data cube abstraction to deal with high dimensional data. It also provides both a comprehensive set of data cube primitives to manipulate the multi dimensional datasets and a dataflow support to build more complex operators on top of and by composing the core ones. It relies on a data parallel programming model based on a seven-step algorithm described in [9]. The Ophidia platform is being primarily used in the climate change domain to carry out scientific analysis of the global climate simulations produced at the Euro-Mediterranean Centre on Climate Change, in the context of the international Coupled Model Intercomparison Project Phase 5 (CMIP5) [10,11]. With regard to existing solutions like ParCAT focusing on climate data, Ophidia provides a domain-agnostic solution. Moreover, while ParCAT [12] is a toolkit, Ophidia is rather a framework. Other solutions, like CDO [13] or NCO [14], are sequential, command line based and designed for climate data, while Ophidia is mainly a framework with parallel operators working on multidimensional data (not only climate). It is also important to state that, to date, the support in terms of available operators in the Ophidia framework is not so broad and complete as the one provided by CDO. Other related projects are SciDB [15] and Rasdaman [16]. Some commonalities with Ophidia relate to the management of n-dimensional arrays, server-side analysis, and declarative approach. Unlike the other two systems, the Ophidia design is strongly based on parallel datawarehouse concepts and architectures, with a key focus on the data cube concept, thus providing a strong support in terms of data cube operators, data cube metadata and data cube centric logical file system. On the other hand, Ophidia does not support versioning (as provided by SciDB instead), nor does it provide OGC (Open Geospatial Consortium) [17] compliant interfaces (WCS, WCPS, WCS-T, and WPS) as in Rasdaman. With regard to well-known big data frameworks like Hadoop [18], Ophidia is designed to provide a native support for scientific data management in terms of array data types and primitives, scientific data formats and numerical libraries, metadata and provenance support, dataflows and experiment- centric logical file system. A comprehensive benchmark comparing Ophidia with other related projects is out of the scope of this paper and will be presented in a future work. The main goal of this paper is to present and discuss in detail the Ophidia software infrastructure from the level-0, related to the Ophidia data store, to the level-3, connected with This work was supported by the Italian Ministry of Education, Universities and Research (MIUR) under the GEMINA contract, the EU FP7 EUBrazilCC Project (Grant Agreement 614048) and by the U.S. Department of Energy, Office of Science (Contract DE-AC02-06CH11357). 978-1-4799-5313-4/14/$31.00 ©2014 IEEE 343