A Tree-based Approach to Integrated Action Localization, Recognition and Segmentation Zhuolin Jiang 1 , Zhe Lin 2 , and Larry S. Davis 1 1 University of Maryland, College Park, MD, 20742 2 Adobe Systems Incorporated, San Jose, CA, 95110 {zhuolin,lsd}@umiacs.umd.edu, zlin@adobe.com Abstract. A tree-based approach to integrated action segmentation, localization and recognition is proposed. An action is represented as a sequence of joint hog-ﬂow descriptors extracted independently from each frame. During training, a set of action prototypes is ﬁrst learned based on a k-means clustering, and then a binary tree model is constructed from the set of action prototypes based on hierarchical k-means cluster- ing. Each tree node is characterized by a shape-motion descriptor and a rejection threshold, and an action segmentation mask is deﬁned for leaf nodes (corresponding to a prototype). During testing, an action is local- ized by mapping each test frame to a nearest neighbor prototype using a fast matching method to search the learned tree, followed by global ﬁl- tering reﬁnement. An action is recognized by maximizing the sum of the joint probabilities of the action category and action prototype over test frames. Our approach does not explicitly rely on human tracking and background subtraction, and enables action localization and recognition in realistic and challenging conditions (such as crowded backgrounds). Experimental results show that our approach can achieve recognition rates of 100% on the CMU action dataset and 100% on the Weizmann dataset. 1 Introduction Action recognition has become an active research topic in computer vision. In this paper, we propose a simultaneous approach to localize and recognize multi- ple action classes based on a uniﬁed tree-based framework. Realistic actions often occur against a cluttered, dynamic background and are subject to large variations in people’s posture and clothing, illumination variations, camera motions and occlusion. Figure 1 shows examples of action frames in realistic environments (with cluttered backgrounds and moving ob- jects). In these cases, it is not easy to detect and segment the actors from the backgrounds. Consequently, they pose a signiﬁcant challenge for those action recognition approaches which perform simple preprocessing such as background subtraction [1–4]. Even though many previous works have been done for action recognition [5–9], robustly localizing and recognizing actions viewed against a cluttered and dynamic background is still important to explore.