TVM: An Automated End-to-End Optimizing Compiler for Deep Learning Tianqi Chen 1 , Thierry Moreau 1 , Ziheng Jiang 2 , Lianmin Zheng 3 , Eddie Yan 1 Meghan Cowan 1 , Haichen Shen 1 , Leyuan Wang 4 , Yuwei Hu 5 , Luis Ceze 1 , Carlos Guestrin 1 , Arvind Krishnamurthy 1 1 Paul G. Allen School of Computer Science & Engineering, University of Washington 2 Fudan University, 3 Shanghai Jiao Tong University, 4 UC Davis, 5 Cornell Abstract There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frame- works rely on vendor-speciﬁc operator libraries and opti- mize for a narrow range of server-class GPUs. Deploying workloads to new platforms such as mobile phones, em- bedded devices, and accelerators (e.g., FPGAs, ASICs) requires signiﬁcant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level op- timizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges speciﬁc to deep learning such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hid- ing. TVM also offers automated optimization of low- level programs to hardware characteristics by employing a novel learning-based cost modeling method for rapid exploration of code optimizations. Experimental results demonstrate that TVM delivers performance across hard- ware back-ends that are competitive with state-of-the-art hand tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM’s ability to target new accelerator back-ends by targeting an FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside sev- eral major companies. 1 Introduction Deep learning models can now recognize images, pro- cess natural language, and defeat humans in challenging strategy games. There is an increasing demand to deploy smart applications to a wide spectrum of devices, rang- ing from cloud servers to self-driving cars and embedded devices. Mapping deep learning workloads to these de- vices is complicated by the diversity of hardware char- acteristics, including embedded CPUs, GPUs, FPGAs, and ASICs (e.g., the TPU [20]). These hardware targets L1D L1I L1D L1I L2 L2 L3 RF RF RF RF L1/TX L1/TX L2 SM SM Activation Buﬀer Accum. Register File Wgt. FIFO CPU GPU ‘TPU’ implicitly managed mixed explicitly managed Memory Subsystem Architecture · · · Compute Primitive scalar vector tensor Figure 1: CPU, GPU and TPU-like accelerators require differ- ent on-chip memory architecture and compute primitives. This divergence must be addressed when generating optimized code. diverge in terms of memory organization, compute func- tional units, etc., as shown in Figure 1. Current deep learning frameworks, such as Tensor- Flow, MXNet, Caffe, and PyTorch rely on a compu- tational graph intermediate representation to implement optimizations such as auto differentiation and dynamic memory management [3, 4, 8]. Graph-level optimiza- tions, however, are often too high-level to handle hard- ware back-end-speciﬁc operator-level transformations. Most of these frameworks focus on a narrow class of server-class GPU devices and delegate target-speciﬁc op- timizations to highly engineered and vendor-speciﬁc op- erator libraries. These operator-level libraries require signiﬁcant manual tuning and hence are too specialized and opaque to be easily ported across hardware devices. Providing support in various deep learning frameworks for diverse hardware back-ends in the present fashion re- quires signiﬁcant engineering effort. Even for supported back-ends, frameworks have to make the difﬁcult choice of avoiding graph optimizations yielding new operators that are not in the predeﬁned operator library, or using unoptimized implementations of these new operators. In order to enable both graph-level and operator-level optimizations for diverse hardware back-ends, we take a fundamentally different, end-to-end approach. We built 1 arXiv:1802.04799v2 [cs.LG] 20 May 2018