1 PAIRS: A scalable geo-spatial data analytics platform Levente J Klein*, Fernando J Marianno, Conrad M Albrecht, Marcus Freitag, Siyuan Lu, Nigel Hinds, Xiaoyan Shao, Sergio Bermudez Rodriguez 1 , Hendrik F Hamann IBM TJ Watson Research Center Yorktown Heights, NY 10598 1 Osram Sylvania, Beverly, MA 01915 * Email: kleinl@us.ibm.com AbstractGeospatial data volume exceeds hundreds of Petabytes and is increasing exponentially mainly driven by images/videos/data generated by mobile devices and high resolution imaging systems. Fast data discovery on historical archives and/or real time datasets is currently limited by various data formats that have different projections and spatial resolution, requiring extensive data processing before analytics can be carried out. A new platform called Physical Analytics Integrated Repository and Services (PAIRS) is presented that enables rapid data discovery by automatically updating, joining, and homogenizing data layers in space and time. Built on top of open source big data software, PAIRS manages automatic data download, data curation, and scalable storage while being simultaneously a computational platform for running physical and statistical models on the curated datasets. By addressing data curation before data being uploaded to the platform, multi-layer queries and filtering can be performed in real time. In addition, PAIRS offers a foundation for developing custom analytics. Towards that end we present two examples with models which are running operationally: (1) high resolution evapo-transpiration and vegetation monitoring for agriculture and (2) hyperlocal weather forecasting driven by machine learning for renewable energy forecasting. Keywords: big data analytics; GIS; Hadoop & HBase for geo- spatial data; MapReduce; data management systems; machine learning I. INTRODUCTION Digitization of the “world” is changing many industries including the way in which geospatial data is analyzed. With daily imaging of earth surface by multiple satellites, spatial and temporal correlations can be established between locations and events in real time. In addition to satellite images, weather or climate models are updated multiple time per days generating insight into the atmosphere–earth interaction and its impact on environment, business activity, and human life. While static or reanalysis studies were carried out, in the past, for example to understand deforestation [1], land use [2], or urban area expansion [3], it is expected that, in the future, these models will run in real time. The exploding volume of geospatial data requires the development of a scalable platform that can fuse multiple data layers and combine them with local measurements from mobile devices or sensor networks. Such a platform should not be only a data repository but should also serve as modeling and analytics platform [4]. Combination of data and analytics can be used for running global models like crop production, water availability, soil moisture or urban expansion. These models can be updated in real time using the latest available datasets and provide insight into dynamic changes like flooding, wildfires, or landslides as they develop. The daily generation rate for selected satellite and weather/climate data sets is summarized in Fig. 1. More than 700 Landsat 8 [5] tiles are acquired daily in addition to 400 Landsat 7 tiles; this generates in excess of 1 Terabyte of geospatial data per day. Similarly, Moderate Resolution Imaging Spectroradiometer (MODIS) [6] instrument data generation rate approaches 1 Terabyte/day, acquiring data in 250 spectral bands. Based on this data, new data products are being derived to analyze earth’s land, ocean, and atmosphere generating even more data. By far the largest geospatial data volume is generated by numerical weather and climate forecasting such as Global Forecast System (GFS), Global Ensemble Forecast System (GEFS), Climate Forecast System (CFS) in the US [7], and the European Centre for Medium- Range Weather Forecasts (ECMWF) model. Weather and climate models generated by ECMWF are in excess of 12 Terabyte/day [8]. If weather and satellite data are to be analyzed and integrated into models before the forecasted data becomes obsolete, then data processing should be accelerated through parallelization. One way to achieve this would be to have all data layers curated and homogenized before being uploaded to the platform, eliminating the time required for data preprocessing. The data curation require data validation, verification, and alignments spatially and temporarily, such that these layers are ready to be integrated into physical and statistical models without the need for data download, validation, and preprocessing.