Characterizing Output Bottlenecks in a Supercomputer Bing Xie ∗ Jeffrey Chase ∗ David Dillow † Oleg Drokin ‡ Scott Klasky † Sarp Oral † Norbert Podhorszki † ∗ Duke University Durham, NC, 27708 Email: {bingxie, chase}@cs.duke.edu † Oak Ridge National Laboratory Oak Ridge, TN, 37831 Email: {dillowda, oralhs, klasky, pnorbert}@ornl.gov ‡ Intel Corporation Knoxville, TN, 37919 Email: oleg.drokin@intel.com Abstract—Supercomputer I/O loads are often dominated by writes. HPC (High Performance Computing) file systems are designed to absorb these bursty outputs at high bandwidth through massive parallelism. However, the delivered write band- width often falls well below the peak. This paper characterizes the data absorption behavior of a center-wide shared Lustre parallel file system on the Jaguar supercomputer. We use a statistical methodology to address the challenges of accurately measuring a shared machine under production load and to obtain the distribution of bandwidth across samples of compute nodes, storage targets, and time intervals. We observe and quantify limitations from competing traffic, contention on storage servers and I/O routers, concurrency limitations in the client compute node operating systems, and the impact of variance (stragglers) on coupled output such as striping. We then examine the implications of our results for application performance and the design of I/O middleware systems on shared supercomputers. I. I NTRODUCTION Output performance is crucial to harnessing the computa- tional power of supercomputers. Some HPC applications [1], [2], [3], [4] run on the scale of hundreds of thousands of compute cores and produce terabyte-scale output bursts for intermediate results and checkpointing or restart files (defen- sive I/O). If the I/O system does not absorb the output fast enough, then memory to buffer the output is exhausted, forcing the computation to stall before it can output more data. Output stalls leave precious CPU resources underutilized, extending application runtime and compromising system throughput. We find that output stalls are often observed in practice, even with asynchronous writes. One way to reduce output stalls is to add more memory and disk spindles. But these hardware resources are expensive, and supercomputers are designed with a careful balance of I/O and computational capabilities. By the classical Amdahl’s rule a balanced petaflop facility requires 128 TB/s of I/O bandwidth. Technology planning for cost-effective deployments has used a more austere baseline of 2 TB/s per petaflop [5], and some systems are designed with even lower ratios. As a result, output bandwidth is a precious resource in supercomputers. Trends suggest that this limitation is not likely to change. Therefore it is crucial for software to make efficient use of the bandwidth. In principle, large write bursts can stream effectively and achieve full bandwidth. In practice, delivered bandwidth is highly sensitive to the application’s use of storage APIs and its data layout, placing an unwel- come burden on domain scientists to manage I/O performance tradeoffs at the application level. This problem has motivated development of adaptive I/O middleware systems, such as ADIOS [6], [7], [8], to present a uniform API to applications and adapt their I/O patterns to the underlying storage system. This paper characterizes output burst absorption on Jaguar, a 2.33 petaflop Cray XK6 housed at the Oak Ridge Leadership Computing Center (OLCF) at Oak Ridge National Laboratory (ORNL). Storage for Jaguar is provided by Spider [9], the 10 petabyte, 240 GB/s Lustre [10] file system at OLCF. The key contribution of our study is to enhance understanding of performance behaviors for state-of-the art software as currently deployed in a leadership-class facility. One purpose of our study is to inform ongoing development of integrated software stacks for parallel storage including parallel file systems and I/O middleware systems such as ADIOS. In particular, our study is an important step toward quantitative models of stor- age system performance behaviors for use by I/O middleware systems. Models can guide choices made at the middleware layer, including dynamic adaptation to “cross-traffic” from competing workloads on shared supercomputers. We use a sampling methodology to address challenges in benchmarking a shared supercomputer. At the time of our study Jaguar was the third-fastest disclosed supercomputer in the world, serving multiple user communities. We are unable to reserve it for exclusive use or replace any part of its system software. We use various configurations of the IOR benchmark [11] to focus traffic on specific stages of the multi-stage write path. We analyze distributions of saturation bandwidths across multiple sample trials in different parts of the machine and at different times. These techniques SC12, November 10-16, 2012, Salt Lake City, Utah, USA 978-1-4673-0806-9/12/$31.00 c 2012 IEEE