Eficient Multi-Tenant Inference on Video using Microclassifiers Giulio Zhou, Thomas Kim, Christopher Canel, Conglong Li, Hyeontaek Lim, David G. Andersen, Michael Kaminsky , Subramanya R. Dulloor Carnegie Mellon University; Intel Labs 1 INTRODUCTION This paper addresses a growing challenge in processing video: The scaling challenge presented by the combination of an increasing number of video sources (cameras) and an increasing number of heavy-weight DNN-based applications (which we term łqueriesž) to be run on each source. As a running example, we draw from an environmental and trafc monitoring deployment at CMU, one feed from which is depicted at right. This feed supports applications such as car and pedestrian counting, open parking spot detection, train detection (in support of an environmental monitoring research project attempting to determine locomotive emissions), and observ- ing if building lights are left on. These cameras are deployed using a mix of the high-speed campus network, and a lower-speed/higher- cost cable modem deployment on power poles in the area. Cost constraints motivate us to be parsimonious with band- width on the wide-area deployment (and planned future wirelessly- connected nodes), and, in general, of our CPU/GPU processing budgets. The high cost of truck rolls to install and upgrade nodes motivates us to put multiple state-of-the-art 4K vision cameras on the nodes to fexibly support future applications, but we lack the bandwidth to backhaul the full feeds. At the same time, both current and future applications may wish to run one or more state- of-the-art DNNs to perform image classifcation [6, 17], object de- tection [7, 13, 14], and video understanding [3, 19]. In this paper, we assume that cameras are fxed (e.g., trafc mon- itoring). Most applications are interested in possibly-overlapping subsequences of frames (e.g., frames containing cars, or trains, or with people moving), and can express a notion of that importance using a query. Frames that match a query are sent back to the dat- acenter for further processing. Each application defnes its own queries and submits them to the edge node. Our system is responsi- ble for efciently executing a multitude of queries at the edge node with low false positives (to avoid wasting resources) and low false negatives (to preserve application fdelity). While an abstract query could be a black-box DNN that takes a frame as input and outputs a binary forward/discard decision, such an approach scales poorly as the number of queries executed on each stream grows. Instead, we develop an idea called micro- classifers, which are small classifers taking as input a subset of the activations of a known, standard convolutional neural network (such as MobileNet [5]). Each query is represented using a unique microclassifer, which specifes both its own internal DNN structure, as well as identifying which (small) subset of the reference CNN activations it accepts as input. Microclassifers enable an edge node to serve tens of queries or more with high accuracy by amortizing the cost of the CNN activations, adding only a small cost per query. Dataset. We evaluate the microclassifer approach on a novel dataset containing 83.5 hours of footage from a camera overlooking train tracks, sampled at 1fps, for a total of 290,758 frames, as shown Figure 1: Regions A and B, corresponding to the Train and Car datasets respectively. in Figure 1. We use the frst 100,000 frames as a training set and the remainder as the test set. We created two sets of labeled data by annotating frames in which (a) a train appears in Region A (the Train dataset); and (b) a car appears in Region B (the Car dataset). There are 2,777 frames that contain images of 21 diferent trains, and 16,070 frames that contain images of 1296 diferent vehicles. Our goal is for this dataset to be a simple yet representative example of a typical trafc monitoring workload. 2 SCALABLE VIDEO QUERIES AT THE EDGE Transfer Learning Doesn’t Work Well. In our preliminary eval- uations, traditional transfer learningÐfne-tuning the last n layers of a CNNÐyielded poor results [Table 2]. There are two contribut- ing factors. First, the frames (and by extension, the features) from a fxed-view camera are generally quite similar, which makes it easy for a fne-tuned MobileNet to overft (as evidenced by fast training convergence and poor test results). Second, because many events in this setting occur in a spatially constrained part of the frame, globally pooled features have weak discriminative ability. Cropping the image to cover only the spatially relevant portion is not scalable nor sufciently general because cropping (a) cannot handle non-rectangular inputs, (b) distorts non-square inputs, and (c) requires the network to be run separately for each image crop. Use Shared CNN Feature Maps Instead. CNNs trained on image classifcation tasks produce nonlinear hierarchical features that ofer a trade-of between spatial localization and semantic in- formation. Selective processing of these features has been used successfully in tasks such as object region proposals, segmentation, and tracking [2, 4, 10, 14], as well as video action classifcation [16]. Our approach uses a single pretrained CNN that runs on every frame in a multi-task fashion. We allow all of our query models 1