A MARKERLESS APPROACH FOR CONSISTENT ACTION RECOGNITION IN A MULTI-CAMERA SYSTEM Simone Calderara 1 , Andrea Prati 2 , Rita Cucchiara 1 1 Dipartimento di Ingegneria dell’Informazione, University of Modena and Reggio Emilia, Italy 2 Dipartimento di Scienze e Metodi dell’Ingegneria, University of Modena and Reggio Emilia, Italy ABSTRACT This paper presents a method for recognizing human actions in a multi-camera setup. The proposed method automati- cally extracts signiﬁcant points on the human body, without the need of artiﬁcial markers. A sophisticated appearance- based tracking able to cope with occlusions is exploited to extract a probability map for each moving object. A seg- mentation technique based on mixture of Gaussians is then employed to extract and track signiﬁcant points on this map, corresponding to signiﬁcant regions on the human silhouette. The point tracking produces a set of 3D trajectories that are compared with other trajectories by means of global align- ment and dynamic programming techniques. Preliminary ex- periments showed the potentiality of the proposed approach. Index Terms— Action recognition, mean tracking, mix- ture of Gaussians, dynamic programming. 1. INTRODUCTION AND RELATED WORKS Labeling actions taking place in a given scene is a task of paramount importance for behavior analysis. The main chal- lenge relies on developing a method able to cope with almost every type of action, even if they are very similar to other one and also in the case of cluttered and complex scenarios. In the recent past, many researchers have addressed action recog- nition in video sequences in different contexts and with dif- ferent purposes, ranging from sports video analysis to video surveillance to human-centred computing. For several years, researchers have concentrated on ad-hoc solutions to identify, often with heuristic rules, speciﬁc actions, such as ﬁghting, talking, etc. [1]. However, recent advances in computer vi- sion and statistical pattern recognition offer an effective and often efﬁcient help for the recognition of higher-level actions, such as abandoned luggage detection, repetitive and abnormal path detection, or people-to-people interactions. Basic approaches for recognizing human actions are based on either the analysis of body shape (in 2D or 3D) or the anal- This work is partially supported by the project BESAFE (Behavior lEarning in Surveilled Areas with Feature Extraction) funded by NATO Sci- ence for Peace programme and by the project FREE SURF funded by Italian MIUR Ministry. ysis of the dynamics of prominent points or parts of the hu- man body. More speciﬁcally, action recognition approaches can be divided into two main groups [2] depending on whether the analysis is performed directly in the image plane (2D ap- proaches) or using a three dimensional reconstruction of the action itself (3D approaches). The latter ones have been widely adopted where building and ﬁtting a 3D model of the body parts performing the action is relatively simple due to con- trolled environmental conditions and high-resolution view of the object. For instance, Regh and Kanade in [3] used a 27 degree-of-freedom (DOF) hand model to recognize poses and gestures, while Goncalves et al.in [4] addressed the problem of analyzing human arm positions against a simple unclut- tered background. These methods are sometimes unfeasible in many real- time surveillance applications. Gavrila and Davis in [5] adopted a 22-DOF human-body model to detect actions against com- plex background but their approach constrains the user to wear a tight-ﬁtting body suit with contrasting limb colors to sim- plify the edge detection problem in case of self-occlusions. Despite the complexity of the approach used, these methods can be applied only if a more or less sophisticated model of the target exists. On the contrary, 2D approaches analyze the action in the image plane relaxing all the environmental constraints of 3D approaches but lowering the discriminative power of the action- classiﬁcation task. People action classiﬁcation can be per- formed in the image plane by either observing and tracking explicitly feature points (local feature approaches [6]), or con- sidering the whole shape-motion as a feature itself (holistic approaches [7, 8]). Yilmaz and Shah in [9] exploited people contour-points tracking to build a 3D volume describing the action and their work represents an example of local feature approaches. A compact representation of this action-speciﬁc volume was pre- sented and proved to be effective in distinguishing among sev- eral predeﬁned actions. Although this proposal results effec- tive in most situations, contour-points tracking is a difﬁcult task to achieve in real-time systems leading to a NP-hard op- timization problem when points are occluding each other and one-to-one matching is impossible. Niebles et al. in [10] proposed a feature-based approach 978-1-4244-2665-2/08/$25.00 c 2008 IEEE