Copyright © [2006] IEEE. Reprinted from Proceedings IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS'06), 2006 Pedestrian Detection and Tracking for Counting Applications in Crowded Situations Oliver Sidla, Yuriy Lypetskyy JOANNEUM RESEARCH, Graz, Austria os@slr-engineering.at Norbert Brändle, Stefan Seer arsenal research, Vienna,Austria Norbert.Braendle@arsenal.ac.at Abstract This paper describes a vision based pedestrian detection and tracking system which is able to count people in very crowded situations like escalator entrances in underground stations. The proposed system uses motion to compute regions of interest and prediction of movements, extracts shape information from the video frames to detect individuals, and applies texture features to recognize people. A search strategy creates trajectories and new pedestrian hypotheses and then filters and combines those into accurate counting events. We show that counting accuracies up to 98 % can be achieved. 1. Introduction and Scope We present a system to detect and track pedestrians in very crowded situations for the purpose of counting them. Applications range from railway transport security, pedestrian traffic management, detection of overcrowding situations in public buildings to tourist flow estimations. Due to its vast number of applications, vision-based pedestrian detection and tracking is a very active research area in the computer vision community. Much progress has been made in the detection and tracking of individuals in groups, where the algorithms are often tested with small amounts of people in laboratory settings [6], [7]. The individuals’ trajectories can be used for counting passing people and be implemented by using virtual gates or tripwires: users can draw straight lines at any location in the field of view, and the algorithm continuously counts how many people are passing it (see Figure 1). Liu et al. [9] apply the human group segmentation algorithm presented in [7] and perform experiments with groups of 5 people. Sacchi et al. [15] present a real world outdoor counting application and report a mean error of 10%. Realistic scenarios, however, do not only contain loose groups of people but rather crowds of individuals like those shown in Figure 1a. For camera views with shallow angles, the mutual occlusions become so severe that no tracking algorithms can handle them effectively, even with a multi camera approach, [8]. a) Camera 1 b) Camera 2 Figure 1. Subway platform scenario, This fact is also acknowledged in [9], where even controlled configurations of 5 people are considered as “extremely difficult cases” for the segmentation of the groups into individuals. One way to avoid severe occlusions is to use top-view cameras, like in [8] or [11]. Actually most of today’s commercially available video-based people counter solutions are based on those configurations. We consider people counting as an added value to security and safety applications and thus want to avoid top view cameras with limited sensing areas and unfamiliar perspectives for security personnel. When dealing with oblique cameras, one solution to avoid group segmentation is to directly estimate the crowd density by extracting significant features and feed those into a classification framework to obtain an estimation of the number of people as in [12], [13]. The accuracy of such systems strongly depends on the training set and on the choice of the feature set. Lin et al. base their people counting on the recognition of head-like contours with on Haar wavelet features and SVM classification, [14]. While they provide quantitative results for model worlds with 125 person-like puppets, they do not provide quantitative results on real world data, due to the lack of ground truth. Our approach for