PRECISE PEOPLE COUNTING IN REAL TIME Luca Zini, Nicoletta Noceti, Francesca Odone DIBRIS, Universit` a degli Studi di Genova Via Dodecaneso 35, 16146 Genova (I) ABSTRACT In this paper we propose a motion-based people counting al- gorithm that relies on a weak camera calibration and produces a smooth estimate of the number of people in the scene. The method performs an analysis of the severity of possible occlu- sions and the integration of instantaneous observations over time. The key features of the algorithm are a simple pipeline, a small computational cost, the use of a model-free approach that does not need complex training procedures and its abil- ity to work in different types of scenarios. We report results on both benchmark and acquired in-house datasets of differ- ent degrees of complexity, showing how our solution achieves comparable or superior performances with respect to state-of- art methods, while providing real-time performances. Index Terms— people counting, scene geometry, tempo- ral ﬁltering, real-time, video-surveillance 1. INTRODUCTION The problem of estimating the number of people in images and videos inﬂuences a broad range of applications (see e.g. [1, 2]): it may be useful in a security framework to detect unusual or potentially dangerous crowding densities, or as a pre-processing for higher level video surveillance algorithms; for commercial purposes, to model the number of people visiting a shop as a function of time and other variables; for public security during large gatherings of people; to plan pub- lic transports. A complexity of this task, commonly referred to as people counting, resides on the high variability of the crowd appearance, due to both intrinsic and extrinsic factors, as for example the distance from the camera, the texture of the background, different densities of groups of people, view- point and occlusions between people. In this paper we propose a simple method, tailored on the video-surveillance setting, that allows us to obtain a precise real-time estimate of the number of people of variable density in a weakly calibrated setting. In literature, the solutions proposed for people counting may be based on appearance – with texture analysis or people de- tection – or motion, or a combination of them. Motion-based methods start from the foreground map and usually try to as- sociate a people number with each portion of the foreground map. This idea has been justiﬁed in [1] and further developed in [3, 2, 4, 5]. In [6] the calibration is used to estimate the area occupied by the crowd on the ground. A cylindrical model of the person is used to estimate the people number; this method, albeit very effective, appears to be too much restrictive for real scenarios and is computationally challenging. Appearance-based methods are motivated by the purpose of avoiding assumptions on the scene (e.g. a planar scenario, static cameras) and of not relying on foreground segmenta- tion algorithms. Among those methods we mention direct counting of detected pedestrians or pedestrian heads [7, 8, 9], possibly coupled with a tracker [10] or background subtrac- tion [11] to improve the performances and ﬁlter out noise. Appearance-based methods pose fewer constraints but are computationally more expensive, expecially in crowded sce- narios. Otherwise, texture-based methods, instead of detect- ing single entities, aim at evaluating the density of a group, for instance with Gray Level Dependency Matrices [12], Sparse Spatio-Temporal Local Binary Pattern (SST-LBP) [13], Ga- bor ﬁlters [14]. These approaches may not adapt to low densities. Finally, local features are used in [15], applying a SVR on feature vectors derived by SURF extracted from moving points, and [16] that simply relates the number of cor- ners to the number of people. Among the hybrid methods we mention [17] and [18]. Usually, the combination of different features is done with machine learning algorithms, requiring the construction of an appropriate training set, system train- ing and parameter tuning. On this respect, to our knowledge, there is no work in literature that shows an explicit proof that the solution learned on a scenario is able to generalize to a different one. The method we propose is a motion-based procedure able to cope with different people densities and diverse conﬁgu- rations (in particular in scenarios with a large scene depths, where the apparent size of objects may change considerably). To the purpose we rely on a weak camera calibration and, similarly to [6], we estimate the area occupied by a person or a group on the ground. Then, we correct the estimated area by means of a piece-wise linear function that models the different amount of intra-person gaps in small and large groups. Finally, we exploit the temporal continuity of the observations to ﬁlter out the effect of temporary occlusions and of ambiguous conﬁgurations. Our method has a very