Recognizing Human Actions by Learning and Matching Shape-Motion Prototype Trees Zhuolin Jiang, Member, IEEE, Zhe Lin, Member, IEEE, and Larry S. Davis, Fellow, IEEE Abstract—A shape-motion prototype-based approach is introduced for action recognition. The approach represents an action as a sequence of prototypes for efficient and flexible action matching in long video sequences. During training, an action prototype tree is learned in a joint shape and motion space via hierarchical K-means clustering and each training sequence is represented as a labeled prototype sequence; then a look-up table of prototype-to-prototype distances is generated. During testing, based on a joint probability model of the actor location and action prototype, the actor is tracked while a frame-to-prototype correspondence is established by maximizing the joint probability, which is efficiently performed by searching the learned prototype tree; then actions are recognized using dynamic prototype sequence matching. Distance measures used for sequence matching are rapidly obtained by look-up table indexing, which is an order of magnitude faster than brute-force computation of frame-to-frame distances. Our approach enables robust action matching in challenging situations (such as moving cameras, dynamic backgrounds) and allows automatic alignment of action sequences. Experimental results demonstrate that our approach achieves recognition rates of 92.86 percent on a large gesture data set (with dynamic backgrounds), 100 percent on the Weizmann action data set, 95.77 percent on the KTH action data set, 88 percent on the UCF sports data set, and 87.27 percent on the CMU action data set. Index Terms—Action recognition, shape-motion prototype tree, hierarchical K-means clustering, joint probability, dynamic time warping. Ç 1 INTRODUCTION A CTION recognition is receiving more and more attention in computer vision due to its potential applications such as video surveillance, human-computer interaction, virtual reality, and multimedia retrieval. Descriptor match- ing and classification-based schemes have been common for action recognition. However, for large-scale action retrieval and recognition where the training database consists of thousands of action videos, such a matching scheme may require tremendous amounts of computation. Recognizing actions viewed against a dynamic varying background is another important challenge. Many studies have been performed on effective feature extraction and categorization methods for robust action recognition. Detailed surveys were reported in [1], [2], [3]. Feature extraction methods for activity recognition can be roughly classified into four categories: geometry-based [4], [5], [6], motion-based [7], [8], [9], [10], appearance-based [4], [11], [12], and space-time feature-based [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28]. The geometry-based approaches recover information about hu- man body configuration, but they often heavily rely on object segmentation and tracking, which is typically difficult and time consuming. The motion-based approaches extract optical flow features for recognition, but they rely on segmentation of the foreground for reducing effects of background flows. The appearance-based approaches use shape and contour information to identify actions, but they are vulnerable to cluttered complex backgrounds. The space- time feature-based approaches either characterize actions using global space-time 3D volumes or more compactly using sparse space-time interest points. Recently, methods have been introduced, e.g., [14], [29], [30], [31], [32], [33], [34], [35], that combine multiple features to detect and recognize actions. Laptev and Perez [14] used shape and motion cues to detect drinking and smoking actions. Jhuang et al. [29] introduced a biologically inspired action recognition system which used a hierarchy of spatial- temporal feature detectors. Liu et al. [30] combined quantized vocabularies of local spatial-temporal volumes and spin images. Shet et al. [31] combined shape and motion exemplars in a unified probabilistic framework to recognize gestures. Schindler and Gool [32] extracted both form and motion features from an action snippet to model and recognize actions. Niebles and Fei-Fei [33] introduced a hierarchical model and a hybrid use of static shape features and spatial-temporal features for action classification. Ahmad and Lee [34] combined shape and motion flows to classify actions from multiview image sequences. Mikolajczyk and Uemura [35] extracted a large set of low-dimensional local features to learn many vocabulary trees to allow efficient action recognition and perform simultaneous action localization and recognition. For recognizing human actions under view changes, there are some approaches proposed in recent years. Junejo et al. [36] proposed a self-similarity-based descriptor for view-inde- pendent human action recognition. Parameswaran and Chellappa [37] modeled actions in terms of view-invariant IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 34, NO. 3, MARCH 2012 533 . Z. Jiang and L.S. Davis are with the Institute for Advanced Computer Studies, University of Maryland, A.V. Williams Building, College Park, MD 20742. E-mail: {zhuolin, lsd}@umiacs.umd.edu. . Z. Lin is with the Advanced Technology Labs, Adobe Systems Incorporated, 345 Park Ave, San Jose, CA 95110. E-mail: zlin@adobe.com. Manuscript received 12 Aug. 2010; revised 13 Mar. 2011; accepted 6 June 2011; published online 19 July 2011. Recommended for acceptance by G. Mori. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2010-08-0618. Digital Object Identifier no. 10.1109/TPAMI.2011.147. 0162-8828/12/$31.00 ß 2012 IEEE Published by the IEEE Computer Society