Weakly Supervised Action Recognition using Implicit Shape Models
Tuan Hue Thi
1
, Li Cheng
2
, Jian Zhang
1
, Li Wang
3
, Shinichi Satoh
4
1
National ICT of Australia & University of New South Wales, NSW, Australia
2
Toyota Technological Institute at Chicago, Illinois, USA
3
Southeast University, Nanjing, China
4
National Institute of Informatics, Tokyo, Japan
Abstract
In this paper, we present a robust framework for ac-
tion recognition in video, that is able to perform com-
petitively against the state-of-the-art methods, yet does
not rely on sophisticated background subtraction pre-
process to remove background features. In particular,
we extend the Implicit Shape Modeling (ISM) of [10]
for object recognition to 3D to integrate local spatio-
temporal features, which are produced by a weakly su-
pervised Bayesian kernel filter. Experiments on bench-
mark datasets (including KTH [11] and Weizmann [5])
verifies the effectiveness of our approach.
1. Introduction
Visual action recognition is a crucial problem in
video analysis and understanding. It is nevertheless
a challenge task due to the non-rigid object and mo-
tion shapes, variations due to changes in viewing an-
gles and distances, and is further complicated by cam-
era motion as well as background clutters. These dif-
ficulties prohibit practical attempts toward building a
rigorous global model for each action class, as they of-
ten bear limited capacities to capture non-rigid shapes
with varying poses, hence provides very little general-
ization for unknown data. Recent work such as [9, 11]
partially address these issues by utilizing local features
that are invariant to pose changes. On the other hand,
to obtain satisfactory recognition rate, a de facto pro-
cedure is to apply dedicate preprocess to each of the
video sequences using sophisticated background sub-
traction techniques, in order to extract accurate fore-
ground objects [4, 5, 6, 7]. This procedure often in-
volves heavy manual interactions and does not general-
ize well to novel videos.
In this paper, we propose a robust approach that is
capable of addressing both limitations. Start with lo-
cal features invariant to view and scale changes, our
approach further applies an improved variant of the
weakly-supervised Bayesian Learning work of Car-
bonetto et al. [2, 8] in Object Detection to videos, to
focus on foreground actions with very little supervi-
sion. Moreover, we extend the Implicit Shape Model
of Leibe et al. [10] to 3D. This enable us to robustly in-
tegrate the set of local features into a global configura-
tion, while still being able to capture local saliency. Em-
pirical experiments convincingly demonstrate the com-
petitiveness of our proposed approach when comparing
with the best known results.
2. Local Features as Video Representation
A video shot in our perspective is a complex set
of local features under various configurations. Tack-
ling action recognition this way as we discussed earlier
helps to lighten the dependency on view and scale vari-
ance of action visual appearance. We adopt the existing
Space Time Interest Point (STIP) detection technique
from Laptev et al. [9] to detect points with high motion
change. In addition, by observing that the certain re-
gions around these detected points are also contributive
to the action context, we refine STIP detection results
with a post Inpainting procedure, which idea is similar
to Image Inpainting described in [3] by Criminisi et al.
The inpainting process starts on the boundary of con-
nected STIP point regions, base on the median scale and
frequency of these STIP points to generate hypothesis
about whether other points in the neighborhood should
be included. Figure 2 illustrates the effect of our im-
proved technique, inpainted Space Time Interest Point
(iSTIP), over the traditional STIP.
The surrounding areas of detected pixels are then
described using a concatenation of Histogram of Ori-
ented Gradients (HOG) and Histogram of Oriented
Flow (HOF) [9]. In order to better organize the inter-
est points in terms of their appearance, we use the ag-
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.858
3505
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.858
3521
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.858
3517
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.858
3517
2010 International Conference on Pattern Recognition
1051-4651/10 $26.00 © 2010 IEEE
DOI 10.1109/ICPR.2010.858
3517