Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems Dongfang Zhao ⋆† , Jian Yin † , Kan Qiao ⋆‡ , and Ioan Raicu ⋆⋄ ⋆ Illinois Institute of Technology † Pacific Northwest National Lab ‡ Google Inc. ⋄ Argonne National Lab dzhao8@iit.edu, jian.yin@pnnl.gov, kqiao@iit.edu, iraicu@cs.iit.edu Abstract—Data compression could ameliorate the I/O pressure of scientific applications on high-performance computing systems. Unfortunately, the conventional wisdom of naively applying data compression to the file or block brings the dilemma between efficient random accesses and high compression ratios. File- level compression can barely support efficient random accesses to the compressed data: any retrieval request need trigger the decompression from the beginning of the compressed file. Block-level compression provides flexible random accesses to the compressed data, but introduces extra overhead when applying the compressor to each every block that results in a degraded overall compression ratio. This paper introduces a concept called virtual chunks aiming to support efficient random accesses to the compressed scientific data without sacrificing its compression ratio. In essence, virtual chunks are logical blocks identified by appended references without breaking the physical continuity of the file content. These additional references allow the decompres- sion to start from an arbitrary position (efficient random access), and retain the file’s physical entirety to achieve high compression ratio on par with file-level compression. One potential concern of virtual chunks lies on its space overhead (from the additional references) that degrades the compression ratio, but our analytic study and experimental results demonstrate that such overhead is negligible. We have implemented virtual chunks in two forms: a middleware to the GPFS parallel file system, and a module in the FusionFS distributed file system. Large-scale evaluations on up to 1,024 cores showed that virtual chunks could help improve the I/O throughput by 2X speedup. I. I NTRODUCTION As today’s scientific applications are becoming data- intensive (e.g. astronomy [1]), one effective approach to relieve the I/O bottleneck of the underlying storage system is data compression. As a case in point, it is optional to apply lossless compressors (e.g. LZO [2], bzip2 [3]) to the input or output files in the Hadoop file system (HDFS) [4], or even lossy compressors [5, 6] at the high-level I/O middleware such as HDF5 [7] and NetCDF [8]. By investing some computational time on compression, we hope to significantly reduce the file size and consequently the I/O time to offset the computational cost. State-of-the-art compression mechanisms of parallel and distributed file systems, however, simply apply the compressor to the data either at the file-level or block-level 1 , and leave the important factors (e.g. computational overhead, compression ratio, I/O pattern) to the underlying compression algorithms. 1 The “chunk”, e.g. in HDFS, is really a file from the work node’s perspective. So “chunk-level” is not listed here. In particular, we observe the following limitations of applying the file-level and block-level compression, respectively: 1) The file-level compression is criticized by the sig- nificant overhead for random accesses: the decom- pression needs to start from the very beginning of the compressed file anyway even though the client might be only requesting some bytes at an arbitrary position of the file. As a case in point, one of the most commonly used operations in climate research is to retrieve the latest temperature of a particular location. The compressed data set is typically in terms of hundreds of gigabytes; nevertheless scientists would need to decompress the entire compressed file to only access the last temperature reading. This wastes both the scientist’s valuable time and scarce computing resources. 2) The deficiency of block-level compression stems from its additional compression overhead larger than the file-level counterpart, resulting in a degenerated compression ratio. To see this, think about a simple scenario that a 64MB file to be compressed with 4:1 ratio and 4KB overhead (e.g. header, metadata, etc.). So the resultant compressed file (i.e. file-level com- pression) is about 16MB + 4KB = 16.004MB. If the file is split into 64KB-blocks each of which is applied with the same compressor, the compressed file would be 16MB + 4KB × 1K = 20MB. Therefore we would roughly spend (20MB - 16.004MB) / 16.004MB ≈ 25% more space in block-level compression than the file-level one. This paper introduces virtual chunks (VC) that aim to better employ existing compression algorithms in parallel and distributed file systems, and eventually to improve the I/O performance of random data accesses in scientific applications and high-performance computing (HPC) systems. The idea of virtual chunks was first presented in the poster session of the Supercomputing 2014 conference [9]. Virtual chunks do not break the original file into physical chunks or blocks, but append a small number of references to the end of file. Each of these references points to a specific block that is considered as a boundary of the virtual chunk. Because the physical entirety (or, continuity of blocks) of the original file is retained, the compression overhead and compression ratio keep comparable to those of file-level compression. With these additional refer- ences, a random file access need not decompress the entire file from the beginning, but could arbitrarily jump onto a reference