Understanding Crowd Collectivity: A Meta-Tracking Approach Afshin Dehghan Mahdi M. Kalayeh Center for Research in Computer Vision, University of Central Florida adehghan@cs.ucf.edu, mahdi@eecs.ucf.edu Abstract Understanding pedestrian dynamics in crowded scenes is an important problem. Given highly fragmented tra- jectories as input, we present a novel, fully unsupervised approach to automatically infer the semantic regions in a scene. Once the semantic regions are learned, given a track- let of a person, our model predicts the pedestrian’s starting point and destination. The method is comprised of three steps. First, the spatial domain of the scene is quantized into hexagons and a 2D orientation distribution function (ODF) is learned for each hexagon. A Time Homogenous Markov Chain Meta-tracking method is used to automatically ﬁnd the sources and sinks and later ﬁnd the dominant paths in the scene. In the last step, using a 3-term based trajectory clustering method, we predict the source and sink for each pedestrian. Furthermore, we introduce a 2-step trajectory reconstruction method to infer the future behavior of each individual in the scene. Qualitative and quantitative experi- ments on a video surveillance dataset from New York Grand Central Station demonstrate the effectiveness of our method both in ﬁnding the semantic regions and grouping of frag- mented tracklets. 1. Introduction Due to the availability of surveillance videos and increas- ing computational power, crowd behavior analysis has re- cently received signiﬁcant attention. This kind of interdis- ciplinary study, often employs or directly involves research results in the area of social sciences in addition to machine learning and statistical methods. However, there is no sin- gle, agreed deﬁnition of ‘a crowd’. Our work follows the deﬁnition in [2]: A crowd can be deﬁned as a gathering of people, standing in close proximity at a speciﬁc location to observe a speciﬁc event, who feel united by a common so- cial identity, and, despite being strangers, are able to act in a socially coherent way. This deﬁnition gives us a better un- derstanding of crowd behavior and leads us toward the goal of this paper which is to identify, model, and learn a repre- sentation of the collective behaviors of people in crowd. Prediction Matched SR Potential SR Tracklet Figure 1. Our proposed method in nutshell One way to represent the collective crowd behaviors is through ﬁnding semantic regions [22] in the scene. Seman- tic regions correspond to the paths that are commonly taken by objects which share the same activities and behavior. Re- cent approaches for understanding collective crowd behav- ior can be divided into two areas based on the respective feature spaces. The ﬁrst category are the so called motion pattern estimation methods [17, 7, 12, 10, 20]. These tech- niques usually employ instantaneous motion vectors (e.g., optical ﬂow), to learn patterns of collective behavior in a scene. They have the advantage of bypassing object track- ing where it is infeasible due to the dense crowd. The sec- ond group of methods attempt to directly organize or cluster long term object trajectories to extract meaningful regions corresponding to dominant paths in the scene [8, 16, 1, 5]. While longer trajectories or sequence of motion ﬂow vec- tors are obviously more discriminative than instantaneous motion, object tracking in crowded scenes is a difﬁcult problem and the result of existing methods is unreliable. Severely fragmented, or worse, mislabeled trajectories sig- niﬁcantly contribute towards noise and errors in the behav- ior of models. In that sense, the shortest motion ﬂow will always be more reliable albeit less discriminative. There- fore, in terms of temporal scale, these two broad categories of techniques respectively correspond to temporally instan- taneous and temporally global features. In this paper we present a new method which works with sparse as well as 1