Small-File Access in Parallel File Systems Philip Carns, Sam Lang, Robert Ross Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439 {carns,slang,rross}@mcs.anl.gov Murali Vilayannur VMware Inc. 3401 Hillview Ave. Palo Alto, CA 94304 muraliv@vmware.com Julian Kunkel, Thomas Ludwig Institute of Computer Science University of Heidelberg {Julian.Kunkel,Thomas.Ludwig} @Informatik.uni-heidelberg.de Abstract—Today’s computational science demands have re- sulted in ever larger parallel computers, and storage systems have grown to match these demands. Parallel file systems used in this environment are increasingly specialized to extract the highest possible performance for large I/O operations, at the expense of other potential workloads. While some applications have adapted to I/O best practices and can obtain good performance on these systems, the natural I/O patterns of many applications result in generation of many small files. These applications are not well served by current parallel file systems at very large scale. This paper describes five techniques for optimizing small- file access in parallel file systems for very large scale systems. These five techniques are all implemented in a single parallel file system (PVFS) and then systematically assessed on two test platforms. A microbenchmark and the mdtest benchmark are used to evaluate the optimizations at an unprecedented scale. We observe as much as a 905% improvement in small-file create rates, 1,106% improvement in small-file stat rates, and 727% improvement in small-file removal rates, compared to a baseline PVFS configuration on a leadership computing platform using 16,384 cores. I. I NTRODUCTION Today’s computational science demands have resulted in ever larger parallel computers, and storage systems for these computers have likewise grown to match the rates at which applications generate data. Parallel file systems used in this environment have become increasingly specialized in an at- tempt to extract the best possible performance from underlying storage hardware for computational science application work- loads. These specialized systems excel at large and aligned concurrent access, and some applications have recognized that performing large accesses to multi-gigabyte files is the best way to leverage parallel file systems. Other applications continue to use other I/O strategies, with varying degrees of success. Meanwhile, scientists in new domains are beginning to use high-performance computing (HPC) resources to attack problems in their areas of expertise, and these applications bring new I/O demands. The results can be seen in recent workload studies. In practice, many HPC storage systems are used to store many small files in addition to the large ones. For example, a 2007 study of a shared parallel file system at the National Energy Research Scientific Computing Center showed that it contained over 13 million files, 99% of which were under 64 MBytes and 43% of which were under 64 KBytes [1]. A similar 2007 study at the Pacific Northwest National Laboratory showed that of the 12 million files on that system, 94% of files were under 64 MBytes and 58% were under 64 KBytes [2]. Further investigation finds that these files come from a number of sources, not just one misbehaving application. Several scientific domains such as climatology, astronomy, and biology generate data sets that are most conveniently stored and organized on a file system as independent files. The following are examples of data sets from each field (respectively): 450,000 Community Climate System Model files with an average size of 61 MBytes [3] 20 million images hosted by the Sloan Digital Sky Survey with an average size of less than 1 MByte [4] up to 30 million files averaging 190 KBytes generated by sequencing the human genome [5] Accessing a large number of small files on a parallel file system shifts the I/O challenge from providing high aggre- gate I/O throughput to supporting highly concurrent metadata access rates. The most common technique currently used to improve metadata rates in file systems is client-side caching. The trend in HPC systems, however, is toward large numbers of multicore processors with a meager amount of local RAM per core. Applications on these systems generally use the majority of this memory, leaving little room for caching. Furthermore, traditional techniques for maintaining coherence and recovering from failures were not designed for use at this scale. In this paper we pursue a strategy of hiding latency and reducing I/O and messaging without using additional resources on clients. We describe five techniques for improving concur- rent metadata and small file I/O performance in parallel file systems: server-driven file precreaton, the readdirplus POSIX extension, file stuffing, metadata commit coalescing, and eager data movement for reads and writes. Of those five techniques, the first two have been previously demonstrated in separate parallel file system implementations. The remaining three are also known optimizations, but we apply them in a novel way to the parallel file system environment. In this paper, all five are implemented in a single file system (PVFS) and tested in a consistent environment to assess their relative value. We also extend the analysis to evaluate behavior at an unprecedented scale on an IBM Blue Gene/P system. The paper is organized as follows. In Section II we describe the relevant aspects of PVFS. In Section III we discuss each of