Action Recognition in Video by Covariance Matching of Silhouette Tunnels Kai Guo, Prakash Ishwar, and Janusz Konrad Department of Electrical and Computer Engineering, Boston University 8 Saint Mary’s St., Boston, MA USA 02215 {kaiguo,pi,jkonrad}@bu.edu Abstract—Action recognition is a challenging problem in video analytics due to event complexity, variations in imaging conditions, and intra- and inter-individual action-variability. Central to these challenges is the way one models actions in video, i.e., action representation. In this paper, an action is viewed as a temporal sequence of local shape-deformations of centroid-centered object silhouettes, i.e., the shape of the centroid-centered object silhouette tunnel. Each action is rep- resented by the empirical covariance matrix of a set of 13- dimensional normalized geometric feature vectors that capture the shape of the silhouette tunnel. The similarity of two actions is measured in terms of a Riemannian metric between their covariance matrices. The silhouette tunnel of a test video is broken into short overlapping segments and each segment is classiﬁed using a dictionary of labeled action covariance matrices and the nearest neighbor rule. On a database of 90 short video sequences this attains a correct classiﬁcation rate of 97%, which is very close to the state-of-the-art, at almost 5-fold reduced computational cost. Majority-vote fusion of segment decisions achieves 100% classiﬁcation rate. Keywords-video analysis; action recognition; silhouette tun- nel; covariance matching; generalized eigenvalues; I. I NTRODUCTION The proliferation of network cameras in the last few years has led to surveillance video overload; cameras produce data at rates far exceeding the capacity of human operators managing video surveillance networks. Thus, of interest are automatic or semi-automatic surveillance-video analysis methods. Of the many facets of video analysis, action recog- nition stands out as particularly important. For example, the ability to recognize that a person is running away from the scene of an accident or that a car is being driven in an erratic manner, can help alert law enforcement in real time or be useful in post-event video forensics. Action recognition also ﬁnds application in video retrieval, video indexing and detection of abnormal behavior. Despite a signiﬁcant effort by the computer vision and image processing communities, action recognition is still a challenging problem on the account of event complex- ity often present in the video (e.g., clutter, occlusions), This material is based upon work supported by the US National Science Foundation (NSF) under awards CNS–0721884 and (CAREER) CCF– 0546598. Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the NSF. variations in the imaging conditions (e.g., illumination, viewpoint, resolution) and only approximate repeatability of the same action by different individuals (e.g., no two individuals walk in exactly the same manner). Central to these challenges is the way one models actions in a video sequence, i.e., action representation. Some of the widely- used action representations are: static features based on limb shapes [1], [2], geometric models of objects [3], [4], motion/optical-ﬂow patterns induced by moving objects [5], [6], and spatio-temporal features extracted from space-time video volume [7], [8], [9]. While some of these represen- tations rely on pixel intensity, others are based on binary masks (often called silhouettes) or motion ﬁelds associated with moving objects. Experience to date has shown that action representation based on pixel intensities is not robust; differently dressed people performing the same action may be considered to act differently. While action recognition based on motion ﬁelds has been quite successful, it requires the additional, but not so simple, step of motion estimation. However, the dynamic nature of an action captured by a motion ﬁeld is largely captured by object’s silhouette evolving in time, i.e., a binary mask of moving object changing its shape in time, that we shall call a silhouette tunnel. Silhouette tunnels, also known as object tunnels [10], [11] or activity tubes [12], have been extensively studied in the literature with applications in video compression, summarization, frame-rate conversion, etc. Although sil- houette tunnels do not capture motion inside objects, the moving silhouette boundary leaves a very distinct signature of occurring activity. Furthermore, a silhouette tunnel is void of color, texture and background characteristics, making it an appropriate representation for action in x-y-t space regardless of photometric properties of the moving object. To date several action recognition methods have been based on silhouettes [7], [13], [14], [15]. In particular, Gorelick et al. [7] developed a method that extracts shape properties of a silhouette tunnel by solving a Poisson equation (measurement of average length of a ran- dom walk from an interior point to silhouette tunnel bound- ary). An action classiﬁcation based on this approach was shown to be remarkably accurate suggesting that the method is capable of extracting highly-discriminative information. However, the procedures used to extract spatio-temporal fea-