Integrating Online Compression to Accelerate Large-Scale Data Analytics Applications Tekin Bicer * , Jian Yin † , David Chiu ‡ , Gagan Agrawal * and Karen Schuchardt † * Computer Science and Engineering Ohio State University E-mail: {bicer, agrawal}@cse.ohio-state.edu † Pacific Northwest National Laboratories E-mail: {jian.yin, karen.schuchardt}@pnnl.gov ‡ Washington State University E-mail: david.chiu@wsu.edu Abstract—Compute cycles in high performance systems are inreasing at a much faster pace than both storage and wide- area bandwidths. To continue improving the performance of large-scale data analytics applications, compression has therefore become promising approach. In this context, this paper makes the following contributions. First, we develop a new compression methodology, which exploits the similarities between spatial and/or temporal neighbors in a popular climate simulation dataset and enables high compression ratios and low decom- pression costs. Second, we developed a framework that can be used to incorporate a variety of compression and decompression algorithms. This framework also supports a simple API to allow integration with an existing application or data processing middleware. Once a compression algorithm is implemented, this framework automatically mechanizes multi-threaded retrieval, multi-threaded data decompression, and the use of informed prefetching and caching. By integrating this framework with a data-intensive middleware, we have applied our compression methodology and framework to three applications over two datasets, including the Global Cloud-Resolving Model (GCRM) climate dataset. We obtained an average compression ratio of 51.68%, and up to 53.27% improvement in execution time of data analysis applications by amortizing I/O time by moving compressed data. I. I NTRODUCTION Science has become increasingly data-driven. Data collected from instruments and simulations is extremely valuable for a variety of scientific endeavors. Both wide-area data dissemi- nation and analysis have become important areas of research over the last few years. These efforts, however, are complicated by the sustained and rapid growth of scientific data sizes. Indeed, increased computational power afforded by today’s high-performance machines allows for simulations with ever higher resolutions over both temporal and spatial scales. As a specific example, the Global Cloud-Resolving Model (GCRM) currently produces 1 petabyte of data for a 4 km grid-cell size over a 10 day simulation. Future plans include simulations with a grid-cell size of 1 km, which will increase the data generation by factor of 64. Even in the short term (i.e., by 2015) it will be possible to perform 2 km resolution simulations, where a single time step of one three-dimensional variable will require 256 GB of storage. At the same time, scientific experiments and instruments are also collecting data with increasing granularity. The Advanced LIGO (Laser Interferometer Gravitational-wave Observatories) Project, funded with a $200 million investment from National Science Foundation, is increasing its sensitivity by a factor of ten, resulting in three orders of magnitude increase in the number of candidates for gravitational wave signals 1 . Unfortunately, wide-area network bandwidth and disk speeds are growing at a much slower rate. This stifles the application scientists’ need to download, manage, and process massive datasets. To reduce data storage, retrieval costs, and transfer overheads, compression techniques have proven to be a popular approach among users [8], [3], [12], [4], [9], [36], [32], [25]. Compression has also been recently applied for reading large scientific files in parallel file systems [40]. How- ever, effectively supporting compression for scientific simu- lation data and integrating compression with data-intensive applications remains a challenge. Specifically, we feel that much additional work is needed along the following directions: • How can the properties of large-scale scientific datasets, especially the simulation datasets, be exploited to develop more effective compression algorithms? • How can we develop software that allows easy plug- and-play of compression and decompression algorithms, while allowing the benefits prefetching, multi-threading, and caching? • How can software be integrated with a data analysis middleware, to help achieve performance benefits for local and remote data analysis? To address the above challenges, this paper makes the following contributions. First, we develop a new compression methodology, which exploits the fact that spatial and/or tem- poral neighbors in simulation data have very similar values. Thus, one can simply store certain base values and the deltas pertaining to these base values, which can be represented more efficiently. In lieu of storing all floating point values, we achieve high compression ratios. Additionally, our compres- sion and decompression scheme exploits hardware supported bitwise operations to keep the costs of coding/decoding very 1 http://media.caltech.edu/press releases/13123