Copyright © [2006] IEEE. Reprinted from Proceedings IEEE International
Conference on Advanced Video and Signal Based Surveillance (AVSS'06), 2006
Pedestrian Detection and Tracking for Counting Applications in
Crowded Situations
Oliver Sidla, Yuriy Lypetskyy
JOANNEUM RESEARCH,
Graz, Austria
os@slr-engineering.at
Norbert Brändle, Stefan Seer
arsenal research,
Vienna,Austria
Norbert.Braendle@arsenal.ac.at
Abstract
This paper describes a vision based pedestrian
detection and tracking system which is able to count
people in very crowded situations like escalator entrances
in underground stations. The proposed system uses
motion to compute regions of interest and prediction of
movements, extracts shape information from the video
frames to detect individuals, and applies texture features
to recognize people. A search strategy creates
trajectories and new pedestrian hypotheses and then
filters and combines those into accurate counting events.
We show that counting accuracies up to 98 % can be
achieved.
1. Introduction and Scope
We present a system to detect and track pedestrians in
very crowded situations for the purpose of counting them.
Applications range from railway transport security,
pedestrian traffic management, detection of overcrowding
situations in public buildings to tourist flow estimations.
Due to its vast number of applications, vision-based
pedestrian detection and tracking is a very active research
area in the computer vision community. Much progress
has been made in the detection and tracking of individuals
in groups, where the algorithms are often tested with
small amounts of people in laboratory settings [6], [7].
The individuals’ trajectories can be used for counting
passing people and be implemented by using virtual gates
or tripwires: users can draw straight lines at any location
in the field of view, and the algorithm continuously
counts how many people are passing it (see Figure 1). Liu
et al. [9] apply the human group segmentation algorithm
presented in [7] and perform experiments with groups of
5 people. Sacchi et al. [15] present a real world outdoor
counting application and report a mean error of 10%.
Realistic scenarios, however, do not only contain loose
groups of people but rather crowds of individuals like
those shown in Figure 1a. For camera views with shallow
angles, the mutual occlusions become so severe that no
tracking algorithms can handle them effectively, even
with a multi camera approach, [8].
a) Camera 1 b) Camera 2
Figure 1. Subway platform scenario,
This fact is also acknowledged in [9], where even
controlled configurations of 5 people are considered as
“extremely difficult cases” for the segmentation of the
groups into individuals.
One way to avoid severe occlusions is to use top-view
cameras, like in [8] or [11]. Actually most of today’s
commercially available video-based people counter
solutions are based on those configurations. We consider
people counting as an added value to security and safety
applications and thus want to avoid top view cameras
with limited sensing areas and unfamiliar perspectives for
security personnel.
When dealing with oblique cameras, one solution to
avoid group segmentation is to directly estimate the
crowd density by extracting significant features and feed
those into a classification framework to obtain an
estimation of the number of people as in [12], [13]. The
accuracy of such systems strongly depends on the training
set and on the choice of the feature set. Lin et al. base
their people counting on the recognition of head-like
contours with on Haar wavelet features and SVM
classification, [14]. While they provide quantitative
results for model worlds with 125 person-like puppets,
they do not provide quantitative results on real world
data, due to the lack of ground truth. Our approach for