An I/O Study of ECP Applications Chen Wang Elena Pourmal cwang@hdfgroup.org epourmal@hdfgroup.org 1 INTRODUCTION We studied and analysed the I/O patterns of four ECP applications and ve HACC-IO benchmarks. Table 1 gives brief descriptions of those applications. In this paper, we describe in details the steps of analyzing and tuning the HACC-IO benchmarks. We illustrate the impact of dif- ferent access patterns, stripe settings and HDF5 metadata. We also compare the ve benchmarks on two dierent parallel le systems, Lustre and GPFS. We show that HDF5 with proper optimizations can catch up the pure MPI-IO implementations. Another goal of this paper is to understand the I/O behaviour of ECP applications and provide a systematic way to proling and tuning the I/O performance. We mainly used two I/O proling tools, Darshan [3] and Recorder [6] to conduct this study. We also made suggestions for each application on how to avoid undesired behaviours and how to further improve the I/O performance. In Section 4, we discuss the observations of for ECP applications. 1.1 Summary of observations and suggestions We summary below the observations and some unexpected be- haviours we found for each application along with the suggestions on how to x them. Detailed results and analysis can be found in Section 3 and Section 4. • FLASH: Unnecessary HDF5 metadata operations H5Acreate(), H5Aopen() and H5Aclose() are used during every checkpointing step. Those operations can be expensive especially when running a large number of iterations. This can be easily xed at the price of losing some code modularity. • NWChem: File-per-process patterns are found for writing local temporary les. This is undesired and will cause a lot of pres- sures on parallel le systems for large scale runs. Conicting patterns are found for the runtime database le, which can lead to consistency issues when running on non-POSIX le systems. • Chombo: The Same le-per-process pattern is observed for Chombo too. Moreover, Chombo by default uses independent I/O to write the nal result to a shared HDF5 le. Depends on the problem scale and underlying le system congurations, collec- tive I/O can be enabled to further optimize the I/O performance. • QMCPack: One unexpected pattern is found for checkpoint les. QMCPACK overwrites the same checkpoint le for each computation section. This can lead to an unrecoverable state if a failure occurred during the checkpointing step. • HACC-IO: HDF5 can use dierent data layout to achieve similar MPI-IO access patterns. Stripe settings of the parallel le system has a big impact on the write performance. Also the default metadata header can greatly slow down the write performance. However, carefully setting the alignment or metadata data block size, HDF5 can deliver a similar performance as the pure MPI-IO implementation. In this paper, we use HACC-IO benchmarks as detailed example to illustrate the process of analysing and tuning I/O performance. In next section, we rst introduce the ve HACC-IO benchmarks we created for this study and then describe the access patterns exhibited by each of the benchmark. In Section 3, we present the tuning parameters we explored and the impact of them on I/O performance. 2 HACC-IO BENCHMARKS In this section, we describe the ve benchmarks we created for this study and the three access patterns exhibited by them. The same access patterns can be found in other scientic applications too. So some general advises and tuning methodologies should apply to other applications as well. In all benchmarks, all processors write 9 variables to a single shared le and each variable has an identical size. Except for one benchmark (which will discuss later), all variables are stored to- gether in an one dimensional array where each element in the array is a double-precision oating point value. The rst two bench- marks are called MPI Contiguous and MPI Interleaved and they are implemented using pure MPI-IO. These two benchmarks serve as the baseline for comparison with the HDF5 implementations. As the names suggested, in MPI Contiguous benchmark, each proces- sor writes each variable contiguously in the le whereas in MPI Interleaved benchmark, each variable is written interleaved. The code for writing variables is shown in Figure 1. The only dierence between them is how to calculate the oset for the next write. In MPI Interleaved benchmark, the next write starts from where the current write nished. In MPI Contiguous benchmark, the next write starts from the current oset plus the variable size. Their access patterns are shown in Figure 4(a) and (b). From the view a local processor, the nine writes are contiguous. However, from the perspective of each variable, the writes are interleaved (in fact they are evenly strided). Also note that Figure 1 shows only the code for independent I/O for simplicity. Collective I/O is implemented using MPI_File_write_at_all instead of MPI_File_write_at. The rest three benchmarks are implemented using the HDF5 library, namely HDF5 Individual, HDF5 Multi and HDF5 Compound. HDF5 Individual benchmark uses the most common way to write multiple variables, with each variable as an individual dataset in the HDF5 representation. In the end, the output HDF5 le has one root group which contains nine separate dataset. The I/O part of the code is shown in Figure 2. This benchmark achieves the same access pattern as MPI Contiguous, as shown in Figure 4(a).