U.S. Department of Energy Best Practices Workshop on File Systems & Archives: * Usability at Los Alamos National Lab † John Bent Los Alamos National Lab johnbent@lanl.gov Gary Grider Los Alamos National Lab ggrider@lanl.gov Abstract There yet exist no truly parallel file systems. Those that make the claim fall short when it comes to providing ad- equate concurrent write performance at large scale. This limitation causes large usability headaches in HPC com- puting. Users need two major capabilities missing from current parallel file systems. One, they need low latency interac- tivity. Two, they need high bandwidth for large parallel IO; this capability must be resistant to IO patterns and should not require tuning. There are no existing paral- lel file systems which provide these features. Frighten- ingly, exascale renders these features even less attainable from currently available parallel file systems. Fortunately, there is a path forward. 1 Introduction High-performance computing (HPC) requires a tremen- dous amount of storage bandwidth. As computational scientists push for ever more computational capability, system designers accommadate them with increasingly powerful supercomputers. The challenge of the last few decades has been that the performance of individual com- ponents such as processors and hard drives as remained relatively flat. Thus, building more power supercomput- ers requires that they be built with increasing numbers of components. Problematically, the mean time to fail- ure (MTTF of individual components has over remained relatively flat over time. Thus, the larger the system, the more frequent the failures. Traditionally, failures have been dealt with by period- ically saving computational state onto persistent storage and then recovering from this state following any failure (checkpoint-restart). The utilization of systems is then measured using goodput which is the percentage of com- puter time that is spent actually making progress towards * San Francisco, CA; September 26-27, 2011 † LANL Release LA-UR 11-11416 the completion of the job. The goal of system designers is therefore to maximize goodput in the face of random failures using an optimal frequency of checkpointing. Determining checkpointing frequency should be straight-forward: measure MTTF, measure amount of data to be checkpointed, measure available storage bandwidth, compute checkpoint time, and plug it into a simple formula [3]. However, measuring available storage bandwidth is not as straightforward as one would hope. Ideally, parallel file systems could achieve some consistent percentage of the hardware capabilities; for example, a reasonable goal for a parallel file system using disk drives for storage would be to achieve 70% of the aggregate disk bandwidth. If this were the case, then a system designer could simply purchase the necessary amount of storage hardware to gain sufficient performance to minimize checkpoint time and maximize system goodput. However, there exist no currently available parallel file systems that can provide any such performance level consistently. 2 Challenges Unfortunately, although there are some that can, there are many IO patterns that cannot achieve any consis- tent percentage of the storage capability. Instead, these IO patterns achieve a consistently low performance such that their percentage of hardware capability diminishes as more hardware is added! For example, refer to Fig- ures 1a, 1b, and 1c, which show that writing to a shared file, N-1, achieves consistently poor performance across the three major parallel file systems whereas the band- width of writing to unique files, N-N, scales as desired with the size of the job. The flat lines for the N-1 work- loads actually show that there is no amount of storage hardware that can be purchased: regardless of size, the bandwidths remain flat. This is because the hardware is not at fault; the performance flaw is within the paral- lel file systems which cannot incur massively concurrent writes and maintain performance. The challenge is due