1520-9210 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2019.2959426, IEEE Transactions on Multimedia JOURNAL OF IEEE TRANSACTIONS ON MULTIMEDIA 1 Spatio-Temporal VLAD Encoding of Visual Events using Temporal Ordering of the Mid-Level Deep Semantics Mohammad Soltanian, Sajjad Amini, Shahrokh Ghaemmaghami, Senior Member, IEEE Abstract —Classification of video events based on frame-level descriptors is a common approach to video recognition. In the meanwhile, proper encoding of the frame-level descriptors is vital to the whole event classification procedure. While there are some pretty efficient video descriptor encoding methods, temporal ordering of the descriptors is often ignored in these encoding algorithms. In this paper, we show that by taking into account the temporal inter-frame depen- dencies and tracking the chronological order of video sub-events, accuracy of event recognition is further improved. First, the frame-level descriptors are ex- tracted using convolutional neural networks (CNNs) pre-trained on ImageNet, which are fine-tuned on a portion of training video frames. Then, a spatio- temporal encoding is applied to the derived descriptors. The proposed spatio-temporal encoding, as the main contribution of this work, is inspired from the well- known vector of locally aggregated descriptors (VLAD) encoding in spatial domain and from total variation de-noising (TVD) in temporal domain. The proposed unified spatio-temporal encoding is then shown to be in the form of a convex optimization problem which is solved efficiently with alternating direction method of multipliers (ADMM) algorithm. The experimental re- sults show superiority of the proposed encoding method in terms of recognition accuracy over both frame-level video encoding approaches and spatio-temporal video representations. As compared to the state-of-the-art approaches, our encoding method improves the mean average precision (mAP) over both Columbia consumer video (CCV), unstructured social activity attribute (USAA), YouTube-8M, and Kinetics datasets and is very competitive on FCVID dataset. Index Terms—Convolutional neural network, Columbia Consumer Video (CCV), Unstructured Social Activity Attribute (USAA), FCVID, YouTube- 8M, Kinetics Vector of locally aggregated descriptors, Alternating direction method of multipliers, Projected gradient descent, Support vector machine. I. Introduction R ECOGNITION of video events has attracted much interest in the computer vision community in recent years. High-level visual event recognition is the process in which video clips containing events of interest are automatically identified. The events often comprise high M. Soltanian, S. Amini, and S. Ghaemmaghami are with the Electrical Engineering Department and Electronics Research Institute, Sharif University of Technology, Tehran, Iran (e-mail: soltanian m@ee.sharif.edu, amini s@ee.sharif.edu, ghaemmagm@sharif.edu). Manuscript received Date; revised Date levels of content complexity, meaning that they encom- pass both short-term and long-term spatial and temporal interactions under diverse environmental settings. Video event recognition can be used in a wide range of applications, e.g., management of personal video col- lections, extensive video search on the web, intelligent advertising, video indexing, content based video retrieval, video browsing, video summarization, smart surveillance, and enhancement of human-computer interaction [1], [2]. For instance, with the increasing number of digital cameras and camera equipped hand-held smart phones, there is an increasing need for indexing and retrieval of videos within huge collections of unconstrained videos. This is one of the places where video event recognition plays a vital role in answering consumers’ expectations for their growing needs for content based video processing [3]. As another example, with the huge growth of video contents, automatic video summarization is inevitably a key requirement to help people browse videos of interest. Video event recognition is also an important element of the summarization system to compactly represent the synopsis of the original video without loss of any important visual events [4]. The video event recognition is a more challenging task, as compared to similar tasks like activity recognition. The main difficulties arise from the vast intra-class variation of events [5], presence of high pre-processing noises [6], complexity of video event structures [7], and variability of video durations [8]. As a matter of fact, the same video event may occur in situations with quite different backgrounds, people, objects, actions, etc. Appearance features [9], [10], and motion features [11], [12], are the most important visual attributes employed in video event recognition task. Likewise, important non- visual attributes to the video recognition are acoustic features [13], [14]. Improved dense trajectories (IDT) [11] comprising of both motion and appearance descriptors [15] is shown to be among best hand designed features for video event recognition [11]. Its performance is superior to that of previously introduced efficient feature generation methods [5], [16] like scale invariant feature transform (SIFT) [17] and space-time interest points (STIP) [18]. However, using IDT in practical scenarios is strictly limited, due to its huge computational complexity, especially when computa- tional resources are restricted [5]. As a matter of fact, other spatial-temporal feature descriptors [19] like HOG3D [20], 3DSIFT [21], Motion Boundary Histogram (MBH) [22]