Small-File Access in Parallel File Systems Philip Carns, Sam Lang, Robert Ross Mathematics and Computer Science Division Argonne National Laboratory Argonne, IL 60439 {carns,slang,rross}@mcs.anl.gov Murali Vilayannur VMware Inc. 3401 Hillview Ave. Palo Alto, CA 94304 muraliv@vmware.com Julian Kunkel, Thomas Ludwig Institute of Computer Science University of Heidelberg {Julian.Kunkel,Thomas.Ludwig} @Informatik.uni-heidelberg.de Abstract—Today’s computational science demands have re- sulted in ever larger parallel computers, and storage systems have grown to match these demands. Parallel ﬁle systems used in this environment are increasingly specialized to extract the highest possible performance for large I/O operations, at the expense of other potential workloads. While some applications have adapted to I/O best practices and can obtain good performance on these systems, the natural I/O patterns of many applications result in generation of many small ﬁles. These applications are not well served by current parallel ﬁle systems at very large scale. This paper describes ﬁve techniques for optimizing small- ﬁle access in parallel ﬁle systems for very large scale systems. These ﬁve techniques are all implemented in a single parallel ﬁle system (PVFS) and then systematically assessed on two test platforms. A microbenchmark and the mdtest benchmark are used to evaluate the optimizations at an unprecedented scale. We observe as much as a 905% improvement in small-ﬁle create rates, 1,106% improvement in small-ﬁle stat rates, and 727% improvement in small-ﬁle removal rates, compared to a baseline PVFS conﬁguration on a leadership computing platform using 16,384 cores. I. I NTRODUCTION Today’s computational science demands have resulted in ever larger parallel computers, and storage systems for these computers have likewise grown to match the rates at which applications generate data. Parallel ﬁle systems used in this environment have become increasingly specialized in an at- tempt to extract the best possible performance from underlying storage hardware for computational science application work- loads. These specialized systems excel at large and aligned concurrent access, and some applications have recognized that performing large accesses to multi-gigabyte ﬁles is the best way to leverage parallel ﬁle systems. Other applications continue to use other I/O strategies, with varying degrees of success. Meanwhile, scientists in new domains are beginning to use high-performance computing (HPC) resources to attack problems in their areas of expertise, and these applications bring new I/O demands. The results can be seen in recent workload studies. In practice, many HPC storage systems are used to store many small ﬁles in addition to the large ones. For example, a 2007 study of a shared parallel ﬁle system at the National Energy Research Scientiﬁc Computing Center showed that it contained over 13 million ﬁles, 99% of which were under 64 MBytes and 43% of which were under 64 KBytes [1]. A similar 2007 study at the Paciﬁc Northwest National Laboratory showed that of the 12 million ﬁles on that system, 94% of ﬁles were under 64 MBytes and 58% were under 64 KBytes [2]. Further investigation ﬁnds that these ﬁles come from a number of sources, not just one misbehaving application. Several scientiﬁc domains such as climatology, astronomy, and biology generate data sets that are most conveniently stored and organized on a ﬁle system as independent ﬁles. The following are examples of data sets from each ﬁeld (respectively): • 450,000 Community Climate System Model ﬁles with an average size of 61 MBytes [3] • 20 million images hosted by the Sloan Digital Sky Survey with an average size of less than 1 MByte [4] • up to 30 million ﬁles averaging 190 KBytes generated by sequencing the human genome [5] Accessing a large number of small ﬁles on a parallel ﬁle system shifts the I/O challenge from providing high aggre- gate I/O throughput to supporting highly concurrent metadata access rates. The most common technique currently used to improve metadata rates in ﬁle systems is client-side caching. The trend in HPC systems, however, is toward large numbers of multicore processors with a meager amount of local RAM per core. Applications on these systems generally use the majority of this memory, leaving little room for caching. Furthermore, traditional techniques for maintaining coherence and recovering from failures were not designed for use at this scale. In this paper we pursue a strategy of hiding latency and reducing I/O and messaging without using additional resources on clients. We describe ﬁve techniques for improving concur- rent metadata and small ﬁle I/O performance in parallel ﬁle systems: server-driven ﬁle precreaton, the readdirplus POSIX extension, ﬁle stufﬁng, metadata commit coalescing, and eager data movement for reads and writes. Of those ﬁve techniques, the ﬁrst two have been previously demonstrated in separate parallel ﬁle system implementations. The remaining three are also known optimizations, but we apply them in a novel way to the parallel ﬁle system environment. In this paper, all ﬁve are implemented in a single ﬁle system (PVFS) and tested in a consistent environment to assess their relative value. We also extend the analysis to evaluate behavior at an unprecedented scale on an IBM Blue Gene/P system. The paper is organized as follows. In Section II we describe the relevant aspects of PVFS. In Section III we discuss each of