Pattern-Direct and Layout-Aware Replication Scheme for Parallel I/O Systems Yanlong Yin , Jibing Li , Jun He , Xian-He Sun , and Rajeev Thakur Computer Science Department Illinois Institute of Technology, Chicago, Illinois 60616 Email: {yyin2, jli33, jhe24, sun}@iit.edu Mathematics and Computer Science Division Argonne National Laboratory, Argonne, Illinois 60439 Email: thakur@mcs.anl.gov Abstract—The performance gap between computing power and the I/O system is ever increasing, and in the meantime more and more High Performance Computing (HPC) ap- plications are becoming data intensive. This study describes an I/O data replication scheme, named Pattern-Direct and Layout-Aware (PDLA) data replication scheme, to alleviate this performance gap. The basic idea of PDLA is replicating identified data access pattern, and saving these reorganized replications with optimized data layouts based on access cost analysis. A runtime system is designed and developed to integrate the PDLA replication scheme and existing parallel I/O system; a prototype of PDLA is implemented under the MPICH2 and PVFS2 environments. Experimental results show that PDLA is effective in improving data access performance of parallel I/O systems. Keywords-Parallel I/O; I/O optimization; data replication; data reorganization; data access pattern I. I NTRODUCTION During the last several decades, the rapid development of semiconductor technology allowed the processor speed to increase exponentially. Supercomputers are moving from petascale towards exascale in the coming decade. However, the developments of the data input/output (I/O) system and storage devices do not keep pace with that of the computing power. As believed by many, the trend of the biased technology advance will continue in the near future. This unbalanced technology advance leads to the so-called I/O-wall problem. In the meantime, large-scale scientific applications grow continuously in terms of data access intensity, imposing greater workload on the I/O and storage subsystems. This trend of applications puts even more pressure on already saturated I/O systems. For instance, in astronomy, giant radio telescopes capture observation images continuously, and then the captured data are saved into storage systems. The data analysis applications, such as Montage [1] developed by NASA, then read the data out of storage systems and analyze them. The telescopes may generate data at a rate of many gigabytes to even petabytes per second and the data analysis is both computational intensive and data intensive [2]. Relatively slow storage devices compounded with data intensive applications make I/O system the primary perfor- mance bottleneck in many HPC systems. This drawback mo- tivates this study, which aims to alleviate the I/O bottleneck, especially for data intensive applications. I/O is a hot issue in recent years. Many I/O optimization techniques have been developed, such as data sieving [3], List I/O [4], DataType I/O [5], and Collective I/O [3] [6]. Some systems may also integrate new layers/middleware into the parallel I/O software stack. All these layers and optimization techniques make the parallel I/O system ex- ceedingly complex. How to optimize I/O performance is elusive, and the optimization is a complex, error-prone, and time-consuming task, especially for applications with complex I/O behaviors. For example, Zhang’s work [7] shows that Collective I/O works well in some cases but not in others. Song’s work [8] shows that finding the optimal data layout configuration in PVFS2 can be a daunting task. Their works further confirm our belief that I/O performance is application dependent, and a general I/O system need to be adjustable to different applications [9]. This raised a must have property of our solution: the I/O optimization should bring the application and system’s characteristics into consideration and be adaptive for different applications. To achieve the goal of alleviating I/O bottleneck and to satisfy the requirement of the I/O optimization’s adaptability, we design and implement the Pattern-Direct and Layout- Aware (PDLA) replication scheme for parallel I/O systems. We design PDLA based on the following facts. 1) Contiguous data access is preferable. The performance of contiguous data access is higher than that of noncontiguous data access. This stays true for both hard disk drives (HDD) and solid state disks (SSD) [10]. 2) Data layout matters. Data layout in parallel file systems can largely influence the I/O performance. Modern parallel file systems support multiple data layout policies. Users can choose to distribute some data on one single storage node, on a set of nodes, or on all available nodes. The previous work [8] shows that, for applications with different data access patterns, the optimal data layouts are different. The optimal data layout yields the 2013 IEEE 27th International Symposium on Parallel & Distributed Processing 1530-2075/13 $26.00 © 2013 IEEE DOI 10.1109/IPDPS.2013.114 345