Understanding Crowd Collectivity: A Meta-Tracking Approach Afshin Dehghan Mahdi M. Kalayeh Center for Research in Computer Vision, University of Central Florida adehghan@cs.ucf.edu, mahdi@eecs.ucf.edu Abstract Understanding pedestrian dynamics in crowded scenes is an important problem. Given highly fragmented tra- jectories as input, we present a novel, fully unsupervised approach to automatically infer the semantic regions in a scene. Once the semantic regions are learned, given a track- let of a person, our model predicts the pedestrian’s starting point and destination. The method is comprised of three steps. First, the spatial domain of the scene is quantized into hexagons and a 2D orientation distribution function (ODF) is learned for each hexagon. A Time Homogenous Markov Chain Meta-tracking method is used to automatically find the sources and sinks and later find the dominant paths in the scene. In the last step, using a 3-term based trajectory clustering method, we predict the source and sink for each pedestrian. Furthermore, we introduce a 2-step trajectory reconstruction method to infer the future behavior of each individual in the scene. Qualitative and quantitative experi- ments on a video surveillance dataset from New York Grand Central Station demonstrate the effectiveness of our method both in finding the semantic regions and grouping of frag- mented tracklets. 1. Introduction Due to the availability of surveillance videos and increas- ing computational power, crowd behavior analysis has re- cently received significant attention. This kind of interdis- ciplinary study, often employs or directly involves research results in the area of social sciences in addition to machine learning and statistical methods. However, there is no sin- gle, agreed definition of ‘a crowd’. Our work follows the definition in [2]: A crowd can be defined as a gathering of people, standing in close proximity at a specific location to observe a specific event, who feel united by a common so- cial identity, and, despite being strangers, are able to act in a socially coherent way. This definition gives us a better un- derstanding of crowd behavior and leads us toward the goal of this paper which is to identify, model, and learn a repre- sentation of the collective behaviors of people in crowd. Prediction Matched SR Potential SR Tracklet Figure 1. Our proposed method in nutshell One way to represent the collective crowd behaviors is through finding semantic regions [22] in the scene. Seman- tic regions correspond to the paths that are commonly taken by objects which share the same activities and behavior. Re- cent approaches for understanding collective crowd behav- ior can be divided into two areas based on the respective feature spaces. The first category are the so called motion pattern estimation methods [17, 7, 12, 10, 20]. These tech- niques usually employ instantaneous motion vectors (e.g., optical flow), to learn patterns of collective behavior in a scene. They have the advantage of bypassing object track- ing where it is infeasible due to the dense crowd. The sec- ond group of methods attempt to directly organize or cluster long term object trajectories to extract meaningful regions corresponding to dominant paths in the scene [8, 16, 1, 5]. While longer trajectories or sequence of motion flow vec- tors are obviously more discriminative than instantaneous motion, object tracking in crowded scenes is a difficult problem and the result of existing methods is unreliable. Severely fragmented, or worse, mislabeled trajectories sig- nificantly contribute towards noise and errors in the behav- ior of models. In that sense, the shortest motion flow will always be more reliable albeit less discriminative. There- fore, in terms of temporal scale, these two broad categories of techniques respectively correspond to temporally instan- taneous and temporally global features. In this paper we present a new method which works with sparse as well as 1