Across-Stack Profiling and Characterization of Machine Learning Models on GPUs Cheng Li ∗ , Abdul Dakkak ∗ University of Illinois, Urbana-Champaign {cli99,dakkak}@illinois.edu Jinjun Xiong IBM T. J. Watson Research Center jinjun@us.ibm.com Wei Wei, Lingjie Xu Alibaba Group {w.wei,lingjie.xu}@alibaba-inc.com Wen-mei Hwu University of Illinois, Urbana-Champaign w-hwu@illinois.edu ABSTRACT The world sees a proliferation of machine learning/deep learning (ML) models and their wide adoption in diferent application do- mains recently. This has made the profling and characterization of ML models an increasingly pressing task for both hardware design- ers and system providers, as they would like to ofer the best possi- ble computing system to serve ML models with the desired latency, throughput, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, frame- work, system libraries, and the hardware (or the HW/SW stack). A thorough characterization requires understanding the behavior of the model execution across the HW/SW stack levels. Existing profling tools are disjoint, however, and only focus on profling within a particular level of the stack. This paper proposes a leveled profling design that leverages existing profling tools to perform across-stack profling. The de- sign does so in spite of the profling overheads incurred from the profling providers. We coupled the profling capability with an automatic analysis pipeline to systematically characterize 65 state- of-the-art ML models. Through this characterization, we show that our across-stack profling solution provides insights (which are difcult to discern otherwise) on the characteristics of ML models, ML frameworks, and GPU hardware. 1 INTRODUCTION Recently there has been numerous impressive advances from ma- chine learning/deep learning (ML) models in solving problems from many domains such as: image classifcation, object detection, ma- chine translation, etc. This has resulted in a surge of interest in deploying these models within various hardware computing plat- forms/devices including commodity servers, accelerators, reconfg- urable hardware, mobile and edge devices, and ASICs. Therefore, there is an increasing need for hardware computing providers (such as cloud providers), computer architects, and system/chip designers to profle and understand these ML models across these computing platforms/devices, and measure their accuracy, throughput, latency, and system resource utilization (memory, bandwidth, etc.). Such an endeavor is, however, greatly hampered, if not impossible, in the current ML landscape. The reasons are multi-fold: ∗ The two authors contributed equally to this paper. Input Pre-Process Output Post-Process Model Inference … BN Data SoftMax Relu Conv Kernel1 Name=ShuffleTensor Grid= Kernel2 Name=OffsetComp Grid= Kernel3 Name=VoltaCUDNN_128x64 Grid= GPU Metrics SP Flop Count=62GFlop DRAM Read Bytes=12.1MB DRAM Write Bytes=296MB Achieved Occupancy=13.2% Model 1 GPU Kernel 3 Layer 2 Figure 1: Model-, layer-, and GPU kernel-level profling for the MLPerf_ResNet50_v1.5 model (Table 6) on System 2 (Ta- ble 5) with batch size 256 using NVIDIA’s NGC TensorFlow v19.06 container. The model layers executed are data (Data), convolution (Conv), batch normalization (BN), relu (Relu), etc. The 3 GPU kernels from the frst Conv layer are high- lighted along with the GPU metrics from the third kernel. • The number of ML models and frameworks is proliferating at an unprecedented pace because of great interest from the com- munity, At the same time, the number of computing hardware platforms/devices interested in running these models and frame- works has also increased. The combinatorial possibilities among hardware, frameworks, and models have rendered the traditional manual process of understanding models not scalable. • The ML models themselves are complicated and involve a stack of HW/SW components. An example is shown in Figure 1. At the model-level, there exists an evaluation pipeline. Components at the model-level include input pre-processing, model inference, and output post-processing. Stepping within the model inference, we fnd layer-level components, or layer nodes within the ML model, such as: convolution (Conv), batch normalization (BN), and Softmax. Within each layer are the GPU kernel-level com- ponents, a series of library calls or computation kernels that are invoked by the layers. Depending on the analysis needs, diferent metrics may be collected at each of these hierarchy levels. • Existing profling tools provide a partial view to the model’s exe- cution. For example, to characterize model latency, users insert timing code around the model inference stage of the evaluation pipeline. To capture the layer-level information, users use the ML framework’s profling capabilities [13, 25]. And, to capture GPU 1 arXiv:1908.06869v1 [cs.LG] 19 Aug 2019