Understanding I/O behavior of Scientiﬁc Deep Learning Applications in HPC systems Hariharan Devarajan Illinois Institute of Technology hdevarajan@hawk.iit.edu Huihuo Zheng Argonne National Laboratory huihuo.zheng@anl.gov Xian-He Sun Illinois Institute of Technology sun@iit.edu Venkatram Vishwanath Argonne National Laboratory venkatv@alcf.anl.gov I. EXTENDED ABSTRACT In the past decade, deep learning (DL) has been applied to a wide range of applications [1], [2], [3] to achieve unprecedented results. These include image recognition [4], natural language processing [5], and even autonomous driving [6], as well as physical science domains such as cosmology [7], materials science [8], [9], and biology [10], [11]. DL methods iteratively adjust the weights within the network to minimize a loss function. At each training step, the application reads a mini-batch of data, computes the gradient of the loss function and then updates the weights of the network using stochastic gradient descent. Many new AI hardware (e.g., GPU, TPU, Cerebras, etc.) have been designed and deployed to accelerate the computation during the training. However, as the size and complexity of the datasets grow rapidly, DL training becomes increasingly read intensive. I/O is a potential bottleneck in the DL applications [12]. On the other hand, more and more scientiﬁc DL studies are performed in high performance supercomputers through a distributed training framework to reduce the training time-to- solution [13]. Therefore, characterizing the I/O behavior of DL workloads in high-performance computing (HPC) environment is crucial for us to address any potential bottlenecks in I/O and to provide useful guidance in performing efﬁcient parallel I/O. In this study, we aim to understand the I/O behavior in scientiﬁc DL applications. As a starting point, we explore a collection of scientiﬁc deep learning workloads which are currently running at Argonne Leadership Computing Facility (ALCF). These workloads are selected from various projects, such as Argonne Data Science Program (ADSP), Aurora Early Science Program (ESP), and Exascale Computing Projects (ECP). The science domains represented by the workloads include neutrino physics [7], cosmology [14], materials science [8], computational physics [15], and biology [16], [10]. Many of the workloads are in active development targeting the upcoming future exascale supercomputers. One of the long term goals for this study is to identify any existing I/O bottlenecks in these workloads on current production machines and suggest I/O optimizations for current applications and as we develop these for future systems. We proﬁle the I/O behavior of eight DL applications on Theta, our current production leadership supercomputer at ALCF. To realize this, we utilize the proﬁlers provided by the DL framework such as TensorFlow proﬁler as well as low- level I/O proﬁler such as Darshan, to study the I/O behavior of applications on supercomputers. These proﬁlers are accompa- nied with their analysis tools. However, to get a holistic view of the application, we developed a Python library, VaniDL, for integrating and post-processing the information obtained from the proﬁling tools and generating high level I/O summary of the application. The main contributions of this work are: 1) proposing a systematic framework for I/O proﬁling for DL workloads and developing an analyzer tool, VaniDL, which provides insights on the I/O behavior of DL applications. 2) preliminary exploration of the I/O behavior of eight scien- tiﬁc DL applications on a leadership supercomputer. II. I/O BEHAVIOR OF HPC DEEP LEANING WORKLOADS A. Methodology Applications: We target distributed DL workloads. These include Neutrino and Cosmic Tagging with UNet [7], Distributed Flood Filling Networks (FFN) for shape recognition in brain tissue [8], Deep Learning Climate Segmentation [17], CosmoFlow for learning universe at scale [14], Cancer Distributed Learning Environment (CANDLE) for cancer research [10], Fusion Recurrent Neural Net for representation learning in plasma science [16], Learning To Hamiltonian Monte Carlo (L2HMC) [15], and TensorFlow CNN Benchmarks [18]. These applications are implemented in TensorFlow and use Horovod for data parallel training. Some of them also have PyTorch implementation. Hardware: We run the applications on Theta. This supercomputer consists of more than 3600 nodes and 864 Aries routers interconnected with a dragonﬂy network. Each router hosts four 2 nd generation Intel Xeon Phi TM processors, coded name Knights Landing (KNL). Each node is equipped with 192 GB of DDR4 and 16 GB of MCDRAM. In all the studies presented here, we set 2 hyper-threads per core for a total of 128 threads per node, and four processes per node. The datasets are stored in Lustre ﬁle system. We set the Lustre stripe size to be 1 MB and stripe count to be 48. Tools: We use Darshan (with extended tracing) as our low- level I/O proﬁling tool along with TensorFlow proﬁler. Ad- ditionally, we process the proﬁling results using our custom analytic tool, VaniDL [19] to integrate the low-level Darshan logs with high-level TensorFlow proﬁler logs and generate a