Detecting Group Activities using Rigidity of Formation ∗ Saad M. Khan School of Computer Science University of Central Florida Orlando, Florida 32816 smkhan@cs.ucf.edu Mubarak Shah School of Computer Science University of Central Florida Orlando, Florida 32816 shah@cs.ucf.edu ABSTRACT Most work in human activity recognition is limited to rel- atively simple behaviors like sitting down, standing up or other dramatic posture changes. Very little has been achieved in detecting more complicated behaviors especially those characterized by the collective participation of several in- dividuals. In this work we present a novel approach to rec- ognizing the class of activities characterized by their rigidity in formation for example people parades, airplane flight for- mations or herds of animals. The central idea is to model the entire group as a collective rather than focusing on each indi- vidual separately. We model the formation as a 3D polygon with each corner representing a participating entity. Tracks from the entities are treated as tracks of feature points on the 3D polygon. Based on the rank of the track matrix we can determine if the 3D polygon under consideration behaves rigidly or undergoes non-rigid deformation. Our method is invariant to camera motion and does not require an a priori model or a training phase. Categories and Subject Descriptors [Image Processing and Computer Vision]: Activity Recognition, Scene Analysis Keywords Rigid Formations, Structure from Motion, Rank Constraint 1. INTRODUCTION Modeling and recognition of human activities using video data poses many challenges. However, a successful solu- tion has numerous applications in video surveillance, video retrieval and summarization, video-to-text synthesis, video ∗ This material is based upon the work funded in part by the U.S. Government. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the U.S. Government. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’05, November 6–12, 2005, Singapore. Copyright 2005 ACM 1-59593-044-2/05/0011 ...$5.00. Figure 1: An example of a parade scene. The rigid- ity of the formation (the red polygon) characterizes the parade activity. communications, biometrics, etc. The task is further com- plicated when the activity is defined by the collective be- havior of a group of entities. In such a scenario monitoring the activity of each participant separately might be unnec- essary or even misleading for correct activity detection. It is the overall pattern that emerges from local interactions that characterizes a group activity. We propose a novel approach to recognizing group activities like people parades that are characterized by the rigidity in formation. By a formation is meant the 3D polygon emerging from the relative locations of a group of people/objects. A formation can be either rigid (i.e. maintaining structure) or deformable depending on the particular activity. Figure 1 demonstrates our idea of a formation of people. We model each walking person as a corner of a 3D polygon. 2D tracks from the walking people are treated as feature points on the formation. (For the pur- pose of this paper we assume that hand picked or accurately calculated tracking data is available). We demonstrate how a rank analysis of the tracking data leads to the classification of activities with rigidity in formation. One of the strengths of our method is its inherent property of invariance to chang- ing view and camera motion. Changes in viewpoint affect the apparent motion and therefore complicate the analysis. Typically this problem is addressed by incorporating a view- invariant match function for comparing images [6]. This is not the case with our method of activity classification which is born into the framework of structure from motion [3 4]. Invariance to camera view is implicitly achieved by factoring out the camera and object motion as the relative pose.