Multiple Scale-speciﬁc Representations for Improved Human Action Recognition Amir H. Shabani a,b , John S. Zelek b , David A. Clausi a a Vision and Image Processing (VIP) Lab b Intelligent Systems Lab Department of Systems Design Engineering University of Waterloo, Waterloo, ON, Canada, N2L 3G1 E-mails: hshabani,jzelek,dclausi@uwaterloo.ca Abstract Human action recognition in video is important in many computer vision applications such as automated surveillance. Human actions can be compactly encoded using a sparse set of local spatio-temporal salient features at different scales. The existing bottom-up methods construct a single dictionary of action primitives from the joint features of all scales and hence, a single action representation. This representation cannot fully exploit the complementary characteristics of the motions across different scales. To address this problem, we introduce the concept of learning multiple dictionaries of action primitives at different resolutions and consequently, multiple scale-speciﬁc representations for a given video sample. Using a decoupled fusion of multiple representa- tions, we improved the human classiﬁcation accuracy of realistic benchmark databases by about 5%, compared with the state-of-the art methods. Keywords: Human action recognition, scale-speciﬁc representation, concatenated representation, decoupled representation, spatio-temporal salient features, separability test. 1. Introduction Humans can easily detect and recognize the type of actions performed in a video. However, the automatic recognition of human actions [1, 2, 3, 4] is a challenge in computer vision with growing applications for automated surveillance [5], content- based video retrieval [6], video summarization [7], elderly home monitoring for assisted living [8], and human-computer interaction [5]. The confusion lies in people performing the same action in noticeably different ways, leading to errors of omission. Also, individuals performing different actions that visually appear to be similar, lead to errors of commission. In addition, illumination and view/scale changes create further challenges to automatically interpret the scene. The discriminative bottom-up approaches become more pop- ular for human action recognition in an unconstrained setting such as youtube videos. A widely used approach is a bag- of-words (BOW) [1, 2, 4, 9] framework (Fig. 1(a)) in which the video contents are sparsely localized by the salient changes such as starts/stops of subactions. In this framework, the salient features are ﬁrst extracted at multiple spatial and tem- poral scales. A single dictionary of action primitives (i.e., vi- sual words) is then learnt from the joint features of all scales from the training video samples. Conventionally, an action is represented by a normalized histogram which shows the fre- quency of the multi-scale features over the action primitives. Fi- nally, a support vector machine (SVM) classiﬁer with a match- ing kernel such as linear, χ 2 , or (Gaussian) radial basis func- tion [4, 9, 10, 11] categorizes an unknown action representation according to its distance from the learnt decision boundaries during training. There are three main elements in a BOW framework which directly affect the ﬁnal action classiﬁcation accuracy: (1) the quality of salient features which capture the local video events, (2) the descriptiveness of the dictionary of action primitives, and consequently, the discriminant of the actions representa- tions, and (3) the matching strategy and type of classiﬁer. Dif- ferent methods use different quality features with different clas- siﬁers [1, 2, 4, 9, 11], but most of these methods use a single dictionary of action primitives and a single action representa- tion which cannot fully exploit the complementary character- istics of the motions at different scales and hence, this single dictionary is not sufﬁciently robust to represent accurately all different motion patterns. Moreover, the intrinsic scale from which the salient features are extracted is a discriminant infor- mation which cannot be encoded in the single action represen- tation. This paper proposes two alternatives to the single non scale- speciﬁc dictionary learning and hence, the single action repre- sentation to improve the discrimination of different actions and consequently, to boost the classiﬁcation accuracy. To address the limitations of a single action representation, we propose to learn a separate dictionary of action primitives for each individ- ual scale and analyze the features of each scale independently. A distinct representation of an action is then obtained using the salient features extracted at a given spatio-temporal scale en- coded by the corresponding dictionary. We will thus have mul- tiple representations of the same action at different scales in which the intrinsic scale of the features are accordingly encoded by construction. There are two viable approaches to fuse these Preprint submitted to Elsevier December 24, 2012