An Online System for Real-Time Activity Detection in Untrimmed Surveillance Videos Aayush Jung Rana aayushjr@knights.ucf.edu Praveen Tirupattur praveentirupattur@knights.ucf.edu Mamshad Nayeem Rizve nayeemrizve@knights.ucf.edu Kevin Duarte kevin duarte@knights.ucf.edu Ugur Demir ugur@knights.ucf.edu Yogesh Rawat yogesh@crcv.ucf.edu Mubarak Shah shah@crcv.ucf.edu Abstract Activity detection in surveillance videos is a challenging problem due to multiple factors such as large field of view, presence of multiple activities, varying scales and view- points, and its untrimmed nature. The requirement of pro- cessing the surveillance videos in real-time makes this more challenging. In this work, we propose a real-time online system to perform activity detection on untrimmed surveil- lance videos. The proposed system consists of three stages: first we detect tubelets with activities, then classify them, and finally merge them to generate spatio-temporal activ- ity detections. We propose a localization network which takes a video clip as input and makes use of feature pyra- mid, multi-layer loss, and atrous convolutions to address the issue of multiple scales and detect small activities in terms of tubelets. The online processing of videos at a clip level drastically reduces the computation time in detecting activities. The detected tubelets are assigned activity class scores and merged together using our proposed Tubelet- Merge Action-Split (TMAS) algorithm to form action tubes. The TMAS algorithm efficiently connects the tubelets in an online fashion to generate spatio-temporal detections which are robust against varying length activities. We perform our experiments on the DIVA (Deep Intermodal Video Analyt- ics) dataset and demonstrate the effectiveness of the pro- posed approach in terms of speed (∼100 fps) and perfor- mance with state-of-the-art results. The code and models will be made publicly available. 1. Introduction Deep convolutional neural networks have achieved im- pressive action classification results in recent years [27, 4, 28]. Similar advancements have been made for the tasks of Figure 1. Top: Two Sample frames from different scenes of the DIVA dataset showing variation in perspective, scale and field-of- view. Bottom: Sample frames from the AVA dataset [10] (left) and from the THUMOS’14 dataset [13] (right). The DIVA dataset contains a greater number of concurrent actions as well as a greater variety of action scales (both spatially and temporally). action detection in trimmed videos [14, 26, 7] and temporal action localization in untrimmed videos [32, 21]. However, these improvements have not been transferred to spatio- temporal action detection in untrimmed videos; current computer vision systems have yet to achieve high perfor- mance on this difficult task. Action detection in untrimmed security videos poses multiple challenges. Surveillance videos comprise of mul- tiple viewpoints and contain several actors performing mul- tiple actions concurrently. These actors have varying scales and tend to be extremely small relative to the video frame, which makes detection of small activities extremely chal- lenging. These challenges make it difficult to extend ex- isting methods to detect actions in the untrimmed security videos found in the DIVA (Deep Intermodal Video Analyt-