COMMUNICATION OPTIMIZATION FOR IO STAGING Tuan Anh Nguyen Georgia Instiute of Technology Atlanta, GA 3032 tuananh@cc.gatech.edu Fang Zheng Georgia Instiute of Technology Atlanta, GA 3032 fzheng@cc.gatech.edu Hasan Abbasi Oak Ridge National Lab Oak Ridge, TN 37831 habbasi@ornl.gov Matthiew Wolf Georgia Instiute of Technology Atlanta, GA 3032 mwolf@cc.gatech.edu Karsten Schwan Georgia Instiute of Technology Atlanta, GA 3032 karsten.schwan@cc.gatech.edu ABSTRACT IO staging is an IO augmentation service to reduce IO cost. IO Staging moves the data from simulation to staging nodes able to exclusively perform custom IO data processing. Such data movement can be time-consuming, particularly with high end codes that can produce hundreds of gigabytes of data every time step, e.g., every few minutes. As a re- sult, when such data movements are done in the background and asynchronously with a simulation’s execution, there is a high likelihood that they coincide with application-level communications, such as its collective operations, therefore perturbing them. Previous work has suggested explicitly scheduling data movements, in ways that take into account application-level behavior, but that requires careful applica- tion profiling and/or monitoring. This paper uses an alter- native approach in which (i) IO staging data movement is adaptively accelerated by selective use of different degrees of parallelism, (ii) adaptation is controlled by parameters that vary the probability of conflicts due to concurrent applica- tion vs. movement-based communications, and (iii) runtime monitoring is used to assess conflict occurrence. Categories and Subject Descriptors H.5 [I/O, storage systems, and data management]: Petascale IO; H.7 [Performance modelling and analy- sis]: Metrics—complexity measures, performance measures General Terms IO, performance, experimental Keywords IO staging, bandwidth, message pipelining, memory bond- ing, derivative-free, portals 1. INTRODUCTION Large-scale simulation produces/consumes hundreds of GB of data for every few minutes. Some peta-scale applications, such as Fusion [8] generates terabytes of data in a single run. As the application utilizes a magnitude of ten thou- sand cores, the cost of IO operations increases to untenable levels and it limits the scalability of peta-scale applications. In some situations, reading data from IO can consume up to 98% of total runtime [4]. Moreover, IO operations also consume substantial power energy [10]. Reducing IO cost becomes important issues. IO staging has emerged as an efficient way to reduce IO cost. IO staging [2] [1] [15] seperates computation domains from IO service domains by using dedicated nodes to perform IO augmentation services such as such as data reorganization, filtering or analysis, etc. Moreover, IO staging uses asyn- chronous transfer to move the data from compute nodes to staging nodes, processes the data at staging nodes, then moves data to IO. By using asynchronous data transfer, IO staging allows simulation to overlap IO service with com- putation. This reduces significantly the overhead for data output handling [2]. The ratio between the number of compute nodes and staging nodes usually ranges from 32:1 to 128:1. Therefore, the time to receive all data from compute nodes to staging nodes can be very costly, especially in the case the IO staging service has the complexity of O(N ). Moreover, asynchronous mov- ing data to staging node is done in the background and it also interferes application communication if it happens at the time simulation is doing collective communication. To improve the efficiency of IO staging, we need to reduce (i) probability and (ii) the amount of time data movement of IO staging overlap with application communication. This paper presents a new way to solve these problems without requiring explicit knowledge about application behavior. In general, we solve these problems by accelerating data trans- fer and iteratively acquiring application communication pat- tern. By accelerating data transfer, we reduce the probabil- ity the data movement of IO staging perturbs application communication. By iteratively acquiring application com- munication pattern, we will able to predict and make de- cision when to move data to minimize the amount of time that communication conflicts happens. More specific, this paper will present methods to: 1. Reduce the probability of communication perturbation