Understanding human interactions with track and body synergies (TBS) captured from multiple views Sangho Park * , Mohan M. Trivedi Department of Computer Science, Computer Vision and Robotics Research Laboratory, University of California, San Diego, San Diego, CA 92122, USA Received 17 October 2006; accepted 3 October 2007 Available online 16 February 2008 Abstract This paper presents a new two-stage multi-view framework for the analysis of human interactions and activities. The analysis is per- formed in a distributed multi-view vision system that synergistically integrates track- and body-level processing. The proposed frame- work is geared toward versatile and easily-deployable systems that do not require careful camera calibration. The main contributions of the paper are as follows; (1) context-dependent view switching for occlusion handling, (2) a method for switching the two-stage anal- ysis between the track- and body-level processing, and (3) a hypothesis–veriﬁcation paradigm for top-down feedback that exploits the spatio-temporal constraints inherent in human interaction. An experimental evaluation shows the eﬃcacy of the proposed system for analyzing multi-person interactions. Ó 2008 Elsevier Inc. All rights reserved. Keywords: Person tracking; Gesture analysis; Distributed video arrays; Multi-level representation; View selection; Situational awareness 1. Introduction and motivation The analysis of human interactions involving objects is an important research problem in computer vision for a wide range of potential applications including video sur- veillance, security enforcement, event annotation, and motion analysis in sports. Multi-person interactions raise particularly diﬃcult issues for computer vision; namely, occlusion between objects and body deformation during interaction. Fig. 1 illustrates multi-person interaction situations in which two-stage multi-view analysis would beneﬁt. In a two-person interaction (Fig. 1a), a single-camera system with viewing-direction V 1 may be suﬃcient for monitoring interaction A between persons P 1andP 2, given the imaging condition is appropriate (i.e., orthogonal viewing-direction from camera). As the imaging conﬁguration becomes sub- optimal as a result of the movement of P 1andP 2(Fig. 1b), the single-camera based monitoring gets more diﬃcult and unreliable due to occlusion and appearance change. In multi-person interactions that involve more than two per- sons (Fig. 1c), a multi-view system may be inevitably required even for best viewing conditions (i.e., orthogonal view from camera). In this situation, the viewing-direc- tions, V 1andV 2, are optimal in monitoring the interaction, A, between persons P 1andP 2, and the interaction B, between persons P 2andP 3, respectively. As the situation becomes more complicated with multi-person movements (Fig. 1d), dynamic selection and coordination of multiple views becomes important and needs data fusion. The pri- mary diﬃculty in data fusion from multiple cameras is to decide when and which camera inputs to fuse. Articulated human motion entails body deformation during activity. An integrated understanding of human activity requires multiple levels of analysis. The scope of this paper considers two stages of detail: track-level and body-level analyses. At the track-level, human activity is analyzed in terms of the tracks of moving bounding boxes around each person. At the body-level, human activity is analyzed in more detail with the coordinated posture and 1077-3142/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2007.10.005 * Corresponding author. E-mail addresses: parks@ucsd.edu (S. Park), mtrivedi@ucsd.edu (M.M. Trivedi). www.elsevier.com/locate/cviu Available online at www.sciencedirect.com Computer Vision and Image Understanding 111 (2008) 2–20