UCF-System:Activity Detection in Untrimmed Videos Ishan Dave * , Zacchaeus Scheffer * , Praveen Tirupattur * , Yogesh Rawat † , Mubarak Shah † Center for Research in Computer Vision University of Central Florida, Orlando, Florida * {ishandave,zaccy,praveentirupattur}@knights.ucf.edu, † {yogesh, shah}@crcv.ucf.edu Abstract Activity detection in surveillance videos is a challenging problem due to multiple factors such as large field of view, presence of multiple activities, varying scales and viewpoints, and its untrimmed nature. The requirement of processing the surveillance videos in real-time makes this more challenging. In this work, we propose a real-time online system to perform activity detection on untrimmed surveillance videos. The proposed system consists of three stages: first we detect tubelets with activities, then classify them, and finally merge them to generate spatio-temporal activity detections. We propose a localization network which takes a video clip as input and makes use of feature pyramid, multi-layer loss, and atrous convolutions to address the issue of multiple scales and detect small activities in terms of tubelets. The online processing of videos at a clip level drastically reduces the computation time in detecting activities. The detected tubelets are assigned activity class scores and merged together using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form action tubes. The TMAS algorithm efficiently connects the tubelets in an online fashion to generate spatio- temporal detections which are robust against varying length activities. We perform our experiments on the DIVA (Deep Intermodal Video Analytics) dataset and demonstrate the effectiveness of the proposed approach in terms of speed (∼100 fps) and performance with state-of-the-art results. The code and models will be made publicly available. 1 Introduction Deep convolutional neural networks have achieved impressive action classification results in recent years [25, 2, 26]. Similar advancements have been made for the tasks of action detection in trimmed videos [12, 24, 5] and temporal action localization in untrimmed videos [29, 19]. However, these improvements have not been transferred to spatio-temporal action detection in untrimmed videos; current computer vision systems have yet to achieve high performance on this difficult task. Action detection in untrimmed security videos poses multiple challenges. Surveillance videos comprise of multiple viewpoints and contain several actors performing multiple actions concurrently. These actors have varying scales and tend to be extremely small relative to the video frame, which makes detection of small activities extremely challenging. These challenges make it difficult to extend existing methods to detect actions in the untrimmed security videos found in the DIVA (Deep Intermodal Video Analytics) dataset [17]. Current methods are trained and evaluated on datasets which contain some, but not all of these challenges. For example, THUMOS’14 [11] is comprised of untrimmed videos, but each video contains only one or two actors performing the same action. The AVA dataset [8] contains multiple actors and actions, but each video is trimmed. Figure 1 shows sample frame from the DIVA dataset and compares them with frames from other action detection datasets.