Virtual Chunks: On Supporting Random Accesses to Scientiﬁc Data in Compressible Storage Systems Dongfang Zhao ⋆† , Jian Yin † , Kan Qiao ⋆‡ , and Ioan Raicu ⋆⋄ ⋆ Illinois Institute of Technology † Paciﬁc Northwest National Lab ‡ Google Inc. ⋄ Argonne National Lab dzhao8@iit.edu, jian.yin@pnnl.gov, kqiao@iit.edu, iraicu@cs.iit.edu Abstract—Data compression could ameliorate the I/O pressure of scientiﬁc applications on high-performance computing systems. Unfortunately, the conventional wisdom of naively applying data compression to the ﬁle or block brings the dilemma between efﬁcient random accesses and high compression ratios. File- level compression can barely support efﬁcient random accesses to the compressed data: any retrieval request need trigger the decompression from the beginning of the compressed ﬁle. Block-level compression provides ﬂexible random accesses to the compressed data, but introduces extra overhead when applying the compressor to each every block that results in a degraded overall compression ratio. This paper introduces a concept called virtual chunks aiming to support efﬁcient random accesses to the compressed scientiﬁc data without sacriﬁcing its compression ratio. In essence, virtual chunks are logical blocks identiﬁed by appended references without breaking the physical continuity of the ﬁle content. These additional references allow the decompres- sion to start from an arbitrary position (efﬁcient random access), and retain the ﬁle’s physical entirety to achieve high compression ratio on par with ﬁle-level compression. One potential concern of virtual chunks lies on its space overhead (from the additional references) that degrades the compression ratio, but our analytic study and experimental results demonstrate that such overhead is negligible. We have implemented virtual chunks in two forms: a middleware to the GPFS parallel ﬁle system, and a module in the FusionFS distributed ﬁle system. Large-scale evaluations on up to 1,024 cores showed that virtual chunks could help improve the I/O throughput by 2X speedup. I. I NTRODUCTION As today’s scientiﬁc applications are becoming data- intensive (e.g. astronomy [1]), one effective approach to relieve the I/O bottleneck of the underlying storage system is data compression. As a case in point, it is optional to apply lossless compressors (e.g. LZO [2], bzip2 [3]) to the input or output ﬁles in the Hadoop ﬁle system (HDFS) [4], or even lossy compressors [5, 6] at the high-level I/O middleware such as HDF5 [7] and NetCDF [8]. By investing some computational time on compression, we hope to signiﬁcantly reduce the ﬁle size and consequently the I/O time to offset the computational cost. State-of-the-art compression mechanisms of parallel and distributed ﬁle systems, however, simply apply the compressor to the data either at the ﬁle-level or block-level 1 , and leave the important factors (e.g. computational overhead, compression ratio, I/O pattern) to the underlying compression algorithms. 1 The “chunk”, e.g. in HDFS, is really a ﬁle from the work node’s perspective. So “chunk-level” is not listed here. In particular, we observe the following limitations of applying the ﬁle-level and block-level compression, respectively: 1) The ﬁle-level compression is criticized by the sig- niﬁcant overhead for random accesses: the decom- pression needs to start from the very beginning of the compressed ﬁle anyway even though the client might be only requesting some bytes at an arbitrary position of the ﬁle. As a case in point, one of the most commonly used operations in climate research is to retrieve the latest temperature of a particular location. The compressed data set is typically in terms of hundreds of gigabytes; nevertheless scientists would need to decompress the entire compressed ﬁle to only access the last temperature reading. This wastes both the scientist’s valuable time and scarce computing resources. 2) The deﬁciency of block-level compression stems from its additional compression overhead larger than the ﬁle-level counterpart, resulting in a degenerated compression ratio. To see this, think about a simple scenario that a 64MB ﬁle to be compressed with 4:1 ratio and 4KB overhead (e.g. header, metadata, etc.). So the resultant compressed ﬁle (i.e. ﬁle-level com- pression) is about 16MB + 4KB = 16.004MB. If the ﬁle is split into 64KB-blocks each of which is applied with the same compressor, the compressed ﬁle would be 16MB + 4KB × 1K = 20MB. Therefore we would roughly spend (20MB - 16.004MB) / 16.004MB ≈ 25% more space in block-level compression than the ﬁle-level one. This paper introduces virtual chunks (VC) that aim to better employ existing compression algorithms in parallel and distributed ﬁle systems, and eventually to improve the I/O performance of random data accesses in scientiﬁc applications and high-performance computing (HPC) systems. The idea of virtual chunks was ﬁrst presented in the poster session of the Supercomputing 2014 conference [9]. Virtual chunks do not break the original ﬁle into physical chunks or blocks, but append a small number of references to the end of ﬁle. Each of these references points to a speciﬁc block that is considered as a boundary of the virtual chunk. Because the physical entirety (or, continuity of blocks) of the original ﬁle is retained, the compression overhead and compression ratio keep comparable to those of ﬁle-level compression. With these additional refer- ences, a random ﬁle access need not decompress the entire ﬁle from the beginning, but could arbitrarily jump onto a reference