Understanding I/O behavior of Scientific Deep Learning Applications in HPC systems Hariharan Devarajan Illinois Institute of Technology hdevarajan@hawk.iit.edu Huihuo Zheng Argonne National Laboratory huihuo.zheng@anl.gov Xian-He Sun Illinois Institute of Technology sun@iit.edu Venkatram Vishwanath Argonne National Laboratory venkatv@alcf.anl.gov I. EXTENDED ABSTRACT In the past decade, deep learning (DL) has been applied to a wide range of applications [1], [2], [3] to achieve unprecedented results. These include image recognition [4], natural language processing [5], and even autonomous driving [6], as well as physical science domains such as cosmology [7], materials science [8], [9], and biology [10], [11]. DL methods iteratively adjust the weights within the network to minimize a loss function. At each training step, the application reads a mini-batch of data, computes the gradient of the loss function and then updates the weights of the network using stochastic gradient descent. Many new AI hardware (e.g., GPU, TPU, Cerebras, etc.) have been designed and deployed to accelerate the computation during the training. However, as the size and complexity of the datasets grow rapidly, DL training becomes increasingly read intensive. I/O is a potential bottleneck in the DL applications [12]. On the other hand, more and more scientific DL studies are performed in high performance supercomputers through a distributed training framework to reduce the training time-to- solution [13]. Therefore, characterizing the I/O behavior of DL workloads in high-performance computing (HPC) environment is crucial for us to address any potential bottlenecks in I/O and to provide useful guidance in performing efficient parallel I/O. In this study, we aim to understand the I/O behavior in scientific DL applications. As a starting point, we explore a collection of scientific deep learning workloads which are currently running at Argonne Leadership Computing Facility (ALCF). These workloads are selected from various projects, such as Argonne Data Science Program (ADSP), Aurora Early Science Program (ESP), and Exascale Computing Projects (ECP). The science domains represented by the workloads include neutrino physics [7], cosmology [14], materials science [8], computational physics [15], and biology [16], [10]. Many of the workloads are in active development targeting the upcoming future exascale supercomputers. One of the long term goals for this study is to identify any existing I/O bottlenecks in these workloads on current production machines and suggest I/O optimizations for current applications and as we develop these for future systems. We profile the I/O behavior of eight DL applications on Theta, our current production leadership supercomputer at ALCF. To realize this, we utilize the profilers provided by the DL framework such as TensorFlow profiler as well as low- level I/O profiler such as Darshan, to study the I/O behavior of applications on supercomputers. These profilers are accompa- nied with their analysis tools. However, to get a holistic view of the application, we developed a Python library, VaniDL, for integrating and post-processing the information obtained from the profiling tools and generating high level I/O summary of the application. The main contributions of this work are: 1) proposing a systematic framework for I/O profiling for DL workloads and developing an analyzer tool, VaniDL, which provides insights on the I/O behavior of DL applications. 2) preliminary exploration of the I/O behavior of eight scien- tific DL applications on a leadership supercomputer. II. I/O BEHAVIOR OF HPC DEEP LEANING WORKLOADS A. Methodology Applications: We target distributed DL workloads. These include Neutrino and Cosmic Tagging with UNet [7], Distributed Flood Filling Networks (FFN) for shape recognition in brain tissue [8], Deep Learning Climate Segmentation [17], CosmoFlow for learning universe at scale [14], Cancer Distributed Learning Environment (CANDLE) for cancer research [10], Fusion Recurrent Neural Net for representation learning in plasma science [16], Learning To Hamiltonian Monte Carlo (L2HMC) [15], and TensorFlow CNN Benchmarks [18]. These applications are implemented in TensorFlow and use Horovod for data parallel training. Some of them also have PyTorch implementation. Hardware: We run the applications on Theta. This supercomputer consists of more than 3600 nodes and 864 Aries routers interconnected with a dragonfly network. Each router hosts four 2 nd generation Intel Xeon Phi TM processors, coded name Knights Landing (KNL). Each node is equipped with 192 GB of DDR4 and 16 GB of MCDRAM. In all the studies presented here, we set 2 hyper-threads per core for a total of 128 threads per node, and four processes per node. The datasets are stored in Lustre file system. We set the Lustre stripe size to be 1 MB and stripe count to be 48. Tools: We use Darshan (with extended tracing) as our low- level I/O profiling tool along with TensorFlow profiler. Ad- ditionally, we process the profiling results using our custom analytic tool, VaniDL [19] to integrate the low-level Darshan logs with high-level TensorFlow profiler logs and generate a