An I/O Study of ECP Applications Chen Wang Elena Pourmal cwang@hdfgroup.org epourmal@hdfgroup.org 1 INTRODUCTION We studied and analysed the I/O patterns of four ECP applications and ve HACC-IO benchmarks. Table 1 gives brief descriptions of those applications. In this paper, we describe in details the steps of analyzing and tuning the HACC-IO benchmarks. We illustrate the impact of dif- ferent access patterns, stripe settings and HDF5 metadata. We also compare the ve benchmarks on two dierent parallel le systems, Lustre and GPFS. We show that HDF5 with proper optimizations can catch up the pure MPI-IO implementations. Another goal of this paper is to understand the I/O behaviour of ECP applications and provide a systematic way to proling and tuning the I/O performance. We mainly used two I/O proling tools, Darshan [3] and Recorder [6] to conduct this study. We also made suggestions for each application on how to avoid undesired behaviours and how to further improve the I/O performance. In Section 4, we discuss the observations of for ECP applications. 1.1 Summary of observations and suggestions We summary below the observations and some unexpected be- haviours we found for each application along with the suggestions on how to x them. Detailed results and analysis can be found in Section 3 and Section 4. • FLASH: Unnecessary HDF5 metadata operations H5Acreate(), H5Aopen() and H5Aclose() are used during every checkpointing step. Those operations can be expensive especially when running a large number of iterations. This can be easily xed at the price of losing some code modularity. • NWChem: File-per-process patterns are found for writing local temporary les. This is undesired and will cause a lot of pres- sures on parallel le systems for large scale runs. Conicting patterns are found for the runtime database le, which can lead to consistency issues when running on non-POSIX le systems. • Chombo: The Same le-per-process pattern is observed for Chombo too. Moreover, Chombo by default uses independent I/O to write the nal result to a shared HDF5 le. Depends on the problem scale and underlying le system congurations, collec- tive I/O can be enabled to further optimize the I/O performance. • QMCPack: One unexpected pattern is found for checkpoint les. QMCPACK overwrites the same checkpoint le for each computation section. This can lead to an unrecoverable state if a failure occurred during the checkpointing step. • HACC-IO: HDF5 can use dierent data layout to achieve similar MPI-IO access patterns. Stripe settings of the parallel le system has a big impact on the write performance. Also the default metadata header can greatly slow down the write performance. However, carefully setting the alignment or metadata data block size, HDF5 can deliver a similar performance as the pure MPI-IO implementation. In this paper, we use HACC-IO benchmarks as detailed example to illustrate the process of analysing and tuning I/O performance. In next section, we rst introduce the ve HACC-IO benchmarks we created for this study and then describe the access patterns exhibited by each of the benchmark. In Section 3, we present the tuning parameters we explored and the impact of them on I/O performance. 2 HACC-IO BENCHMARKS In this section, we describe the ve benchmarks we created for this study and the three access patterns exhibited by them. The same access patterns can be found in other scientic applications too. So some general advises and tuning methodologies should apply to other applications as well. In all benchmarks, all processors write 9 variables to a single shared le and each variable has an identical size. Except for one benchmark (which will discuss later), all variables are stored to- gether in an one dimensional array where each element in the array is a double-precision oating point value. The rst two bench- marks are called MPI Contiguous and MPI Interleaved and they are implemented using pure MPI-IO. These two benchmarks serve as the baseline for comparison with the HDF5 implementations. As the names suggested, in MPI Contiguous benchmark, each proces- sor writes each variable contiguously in the le whereas in MPI Interleaved benchmark, each variable is written interleaved. The code for writing variables is shown in Figure 1. The only dierence between them is how to calculate the oset for the next write. In MPI Interleaved benchmark, the next write starts from where the current write nished. In MPI Contiguous benchmark, the next write starts from the current oset plus the variable size. Their access patterns are shown in Figure 4(a) and (b). From the view a local processor, the nine writes are contiguous. However, from the perspective of each variable, the writes are interleaved (in fact they are evenly strided). Also note that Figure 1 shows only the code for independent I/O for simplicity. Collective I/O is implemented using MPI_File_write_at_all instead of MPI_File_write_at. The rest three benchmarks are implemented using the HDF5 library, namely HDF5 Individual, HDF5 Multi and HDF5 Compound. HDF5 Individual benchmark uses the most common way to write multiple variables, with each variable as an individual dataset in the HDF5 representation. In the end, the output HDF5 le has one root group which contains nine separate dataset. The I/O part of the code is shown in Figure 2. This benchmark achieves the same access pattern as MPI Contiguous, as shown in Figure 4(a).