Adaptive Streaming Perception using Deep Reinforcement Learning Anurag Ghosh, Akshay Nambi, Aditya Singh, Harish YVS, Tanuja Ganu Microsoft Research India t-angh, akshayn, t-adsingh, t-harishyvs, taganu@microsoft.com Abstract Executing computer vision models on streaming visual data, or streaming percep- tion is an emerging problem, with applications in self-driving, embodied agents, and augmented/virtual reality. The development of such systems is largely gov- erned by the accuracy and latency of the processing pipeline. While past work has proposed numerous approximate execution frameworks, their decision func- tions solely focus on optimizing latency, accuracy, or energy, etc. This results in sub-optimum decisions, affecting the overall system performance. We argue that the streaming perception systems should holistically maximize the overall system performance (i.e., considering both accuracy and latency simultaneously). To this end, we describe a new approach based on deep reinforcement learning to learn these tradeoffs at runtime for streaming perception. This tradeoff optimization is formulated as a novel deep contextual bandit problem and we design a new re- ward function that holistically integrates latency and accuracy into a single metric. We show that our agent can learn a competitive policy across multiple decision dimensions, which outperforms state-of-the-art policies on public datasets. 1 Introduction Increasing number of scenarios have started relying on executing computer vision tasks, viz., classifi- cation or detection or segmentation on streaming visual data (or streaming perception) [4, 40, 20]. For practical applications, such as self-driving vehicles or augmented and virtual reality (AR/VR), it is critical for the processing pipeline to be both fast and accurate, thus maximizing overall streaming performance, both accuracy and latency. Past works has explored numerous latency sensitive approaches [17, 16, 35] and approximate execu- tion frameworks [13, 30, 40] that either adheres to strict latency requirements, or resource constraints, or an accuracy target, resulting in sub-optimum decisions. These approaches have three critical draw- backs for streaming perception tasks: (D1) The cost associated to figure out the different tradeoffs will explode combinatorially as the number of choices increase. For example, in streaming detection if the decision function has to choose the appropriate input resolution {360, 480, 560, 640, 720}, number of proposals {100, 300, 500, 1000}, tracker resolution {360, 480, 560, 640, 720} and stride {3, 5, 10, 15, 30}, then the design space will have over 500 combinations, which is non-trivial to select (See Section 3). (D2) The decision function do not consider content characteristics of the streaming data at run time. Content-aware approaches [39, 4, 38] adapt the configurations or switch between models based on the content characteristics, e.g., complex frames are passed to expensive deeper/wider models. However, classifying a frame as “simple" or “complex" is challenging and identifying the right metrics to extract the characteristics is non-trivial (See Section 4.1). (D3) The decision function is not linked to overall accuracy and latency performance of the system. Current execution pipelines do not jointly optimize accuracy and latency for online real-time tasks. Streaming perception is significantly more challenging than offline perception [20], necessitating design of new reward functions that accommodates both accuracy and latency (See Section 4.2). Preprint. arXiv:2106.05665v1 [cs.CV] 10 Jun 2021