ACTION RECOGNITION USING INTEREST POINTS CAPTURING DIFFERENTIAL MOTION INFORMATION Gaurav Kumar Yadav † , Prakhar Shukla ⋆ , Amit Sethi † † Department of Electronics and Electrical Engineering, IIT Guwahati ⋆ Department of Mathematics, IIT Guwahati, Guwahati ABSTRACT Human action recognition has been a challenging task in com- puter vision because of intra-class variability. State-of-the- art methods have shown good performance for constrained videos but have failed to achieve good results for complex scenes. Reasons for their failing include treating spatial and temporal dimensions without distinction as well as not captur- ing temporal information during feature extraction or video representation. To address these problems we propose princi- pled changes to an action recognition framework that is based on video interest points (IP) detection with capturing differ- ential motion as the central theme. First, we propose to detect points with high curl of optical ﬂow, which captures relative motion boundaries in a frame. We track these points to form dense trajectories. Second, we discard points on the trajecto- ries that do not represent change in motion of the same ob- ject, yielding temporally localized IPs. Third, we propose a video representation based on spatio-temporal arrangement of IPs with respect to their neighboring IPs. The proposed approach yields a compact and information-dense represen- tation without using any local descriptor around the detected IPs. It signiﬁcantly outperforms state-of-the-art methods on UCF youtube dataset, which has complex action classes, as well as on KTH dataset, which has simple action classes. Index Terms— Video interest points, action recognition, dense trajectories, Optical ﬂow 1. INTRODUCTION Human action recognition has been studied very extensively due its potential applications in video surveillance, search, and retrieval. Deﬁning and recognizing a class of actions is fraught with problems such as large variations in motion, posture, and clothing of people, as well as variations in scene illumination and background. A widely applicable solution to deal with all these variations is yet to come. Until the advent of standardized action datasets such as Weizman [1] and KTH [2], comparing techniques was not easy. However, most datasets and techniques were still based on simplifying assumptions such as uncluttered background, isolated actions and static camera until more complex datasets such as UCF11 [3] and techniques such as [4] came about. We compare our methods on both simple [2] and complex [3] datasets. Most of the methods for human action recognition can be classiﬁed broadly into two categories [5]: hierarchical and single-layered approaches. Hierarchical approaches break a complex activity into simple activities or sub-events. Mul- tiple layers of sub-events are constructed for the analysis of complex activities. Such methods however, are more complex and recognition of the high level activities run into problems if the sub-events or low-level activities are not reliably recog- nized. On the other hand, single-layered approaches tend to be faster and more suitable for real time applications because these recognize actions directly from the video. These are further classiﬁed into two categories: Space-time approaches and sequential approaches. Space-time approaches tend to be the fastest because they do not consider temporal order un- like sequential approaches and are further divided into three categories: those based on space-time volumes, trajectories, and interest points. Our method is based on trajectories and interest points while it also partly incorporates elements of sequential approaches to incorporate the advantages of these three groups of techniques. A lot of action recognition methods have been proposed based on interest points (IPs) such as [6], [7], and [8] which are characterized by their detectors, descriptors and fusion methods. Although many interest point detectors have been proposed for videos such as STIP [9], selective-STIP [4], Cuboid [10], n-SIFT [11], Mo-SIFT [12] , curl of optical ﬂow (COF) [13] etc., except Mo-SIFT and COF all other meth- ods treat temporal dimension in a manner similar to the two spatial dimensions, thus extending 2-d spatial interest point detectors to 3-d. This is not appropriate as shown in [13] because of unique properties of temporal dimension such as object persistence and smoothness. We extend this approach signiﬁcantly based on three contributions that capture our proposed theme that differential motion has important infor- mation about an action. Our ﬁrst contribution is to use and extend the interest point detector proposed in [13] that was based on unique properties of the time axis, and captured points on relative motion boundaries. The threshold applied to curl of opti- cal ﬂow was ﬁxed in [13], which led to large variations in