Nexus: A GPU Cluster Engine for Accelerating Neural Networks Based Video Analysis Haichen Shen University of Washington haichen@cs.washington.edu Lequn Chen University of Washington lqchen@cs.washington.edu Yuchen Jin University of Washington yuchenj@cs.washington.edu Liangyu Zhao University of Washington liangyu@cs.washington.edu Bingyu Kong Shanghai Jiao Tong University bingyukong97@gmail.com Matthai Philipose Microsoft Research matthaip@microsoft.com Arvind Krishnamurthy University of Washington arvind@cs.washington.edu Ravi Sundaram Northeastern University koods@ccs.neu.edu Abstract We address the problem of serving Deep Neural Networks (DNNs) efficiently from a cluster of GPUs. In order to realize the promise of very low-cost processing made by accelerators such as GPUs, it is essential to run them at sustained high uti- lization. Doing so requires cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments of DNNs. Nexus is a fully im- plemented system that includes these innovations. On large- scale case studies on 16 GPUs, Nexus shows 1.8-12.7× better throughput than state-of-the-art systems while staying within latency constraints >99% of the time. A long-running multi- application deployment on an 88-GPU cluster violates latency SLOs on 1.3% of requests and stays within 32% of an aggres- sive lower bound on GPU usage. 1 Introduction Consider a cloud-scale video analysis service that allows thou- sands of tenants to analyze thousands of streams each concur- rently. Increasingly, the core computations for this workload are Deep Neural Networks (DNNs), which are networks of dense linear algebra computations. Specialized hardware ac- celerators for DNNs, in the form of Graphic Processing Units (GPUs, which this paper focuses on) and even more special- ized Tensor Processing Units (TPUs) have emerged in the recent past. GPU accelerators process DNNs orders of magni- tude faster and cheaper than CPUs in many cases. However, GPUs are expensive and very-high-capacity: modern devices each provide over 100 TFLOPS. Cost-savings from using them depends critically on operating them at sustained high utilization. A fundamental problem therefore is to distribute the large incoming workload onto a cluster of accelerators at high accelerator utilization and acceptable latency. We ad- dress this problem in this paper. Conceptually, this problem can be thought of as sharding in- puts via a distributed frontend onto DNNs on backend GPUs. Several interacting factors complicate this viewpoint. First, given the size of GPUs, it is often necessary to place different types of networks on the same GPU. It is then important to select and schedule them so as to maximize their combined throughput while satisfying latency bounds. Second, many applications consists of groups of DNNs that feed into each other. It is important to be able to specify these groups, and to schedule the execution of the entire group on the cluster so as to maximize performance. Third, it is well known that dense linear algebra computations such as DNNs execute much more efficiently when their inputs are batched together. Batching fundamentally complicates scheduling and routing because (a) it benefits from cross-tenant and cross-request co- ordination and (b) it forces the underlying bin-packing-based scheduling algorithms to incorporate batch size. Fourth, the increasingly common use of transfer learning in today’s work- loads has led to specialization of networks, where two tasks that formerly used identical networks now use networks that are only mostly identical. Since batching only works when multiple inputs are applied to the same model in conventional DNN execution systems, the benefits of batching are lost. Nexus is a GPU cluster for DNN execution that addresses these problems to attain high execution throughput under la- tency Service Level Objectives (SLOs). It uses three main techniques to do so. First, it relies on a novel batching-aware scheduler (Section 6.1) that performs bin packing when the balls being packed into bins have variable size, depending on the size of the batch they are in. This schedule specifies the GPUs needed, the distribution of DNNs across them and the