Weakly Supervised Action Recognition using Implicit Shape Models Tuan Hue Thi 1 , Li Cheng 2 , Jian Zhang 1 , Li Wang 3 , Shinichi Satoh 4 1 National ICT of Australia & University of New South Wales, NSW, Australia 2 Toyota Technological Institute at Chicago, Illinois, USA 3 Southeast University, Nanjing, China 4 National Institute of Informatics, Tokyo, Japan Abstract In this paper, we present a robust framework for ac- tion recognition in video, that is able to perform com- petitively against the state-of-the-art methods, yet does not rely on sophisticated background subtraction pre- process to remove background features. In particular, we extend the Implicit Shape Modeling (ISM) of [10] for object recognition to 3D to integrate local spatio- temporal features, which are produced by a weakly su- pervised Bayesian kernel filter. Experiments on bench- mark datasets (including KTH [11] and Weizmann [5]) verifies the effectiveness of our approach. 1. Introduction Visual action recognition is a crucial problem in video analysis and understanding. It is nevertheless a challenge task due to the non-rigid object and mo- tion shapes, variations due to changes in viewing an- gles and distances, and is further complicated by cam- era motion as well as background clutters. These dif- ficulties prohibit practical attempts toward building a rigorous global model for each action class, as they of- ten bear limited capacities to capture non-rigid shapes with varying poses, hence provides very little general- ization for unknown data. Recent work such as [9, 11] partially address these issues by utilizing local features that are invariant to pose changes. On the other hand, to obtain satisfactory recognition rate, a de facto pro- cedure is to apply dedicate preprocess to each of the video sequences using sophisticated background sub- traction techniques, in order to extract accurate fore- ground objects [4, 5, 6, 7]. This procedure often in- volves heavy manual interactions and does not general- ize well to novel videos. In this paper, we propose a robust approach that is capable of addressing both limitations. Start with lo- cal features invariant to view and scale changes, our approach further applies an improved variant of the weakly-supervised Bayesian Learning work of Car- bonetto et al. [2, 8] in Object Detection to videos, to focus on foreground actions with very little supervi- sion. Moreover, we extend the Implicit Shape Model of Leibe et al. [10] to 3D. This enable us to robustly in- tegrate the set of local features into a global configura- tion, while still being able to capture local saliency. Em- pirical experiments convincingly demonstrate the com- petitiveness of our proposed approach when comparing with the best known results. 2. Local Features as Video Representation A video shot in our perspective is a complex set of local features under various configurations. Tack- ling action recognition this way as we discussed earlier helps to lighten the dependency on view and scale vari- ance of action visual appearance. We adopt the existing Space Time Interest Point (STIP) detection technique from Laptev et al. [9] to detect points with high motion change. In addition, by observing that the certain re- gions around these detected points are also contributive to the action context, we refine STIP detection results with a post Inpainting procedure, which idea is similar to Image Inpainting described in [3] by Criminisi et al. The inpainting process starts on the boundary of con- nected STIP point regions, base on the median scale and frequency of these STIP points to generate hypothesis about whether other points in the neighborhood should be included. Figure 2 illustrates the effect of our im- proved technique, inpainted Space Time Interest Point (iSTIP), over the traditional STIP. The surrounding areas of detected pixels are then described using a concatenation of Histogram of Ori- ented Gradients (HOG) and Histogram of Oriented Flow (HOF) [9]. In order to better organize the inter- est points in terms of their appearance, we use the ag- 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.858 3505 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.858 3521 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.858 3517 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.858 3517 2010 International Conference on Pattern Recognition 1051-4651/10 $26.00 © 2010 IEEE DOI 10.1109/ICPR.2010.858 3517