Bayesian Template-based People Detection Gwenn Englebienne * , Tim van Oosterhout , Ben Kr¨ ose *† * IAS Group, University of Amsterdam, Amsterdam, Netherlands Email: {G.Englebienne,B.J.A.Krose}@uva.nl Hogeschool van Amsterdam, Amsterdam Abstract—Robust, real-time visual tracking of an arbitrary number of people is a challenging and important problem. Background segmentation methods are widely used, but they result in discarding important information early on in the process, by enforcing a hard boundary between foreground and background. Template-based methods have been shown to be effective at solving the problem of tracking specific objects such as human faces, but their large number of free parameters can make them slow to apply and hard to optimise globally. In this work, we propose a template-based method for tracking people with fixed cameras, which automatically detects the number of people in a frame, is robust to occlusions, and can run at near- real-time frame rates. I. I NTRODUCTION Tracking the motion of people in video images is applied to a variety of situations, including athletic performance analysis, content-based video retrieval, surveillance applications, crowd flow analysis and people counting, but also as a preprocessing step to more advanced methods such as gait analysis, be- haviour modelling, etc. In typical scenarios, accurate tracking can be extremely challenging. Multiple effects such as varying illumination, occlusions, shadows, specularities and non-static backgrounds all contribute to make the tracking process quite complex. In this paper, we focus on the detection of humans in indoor scenes with fixed cameras. In many applications, most notably in surveillance applications, the computational cost of the detection is critical and one would want the detection process to happen at or above the video frame rate. We propose a simple but very effective probabilistic method, which allows the automatic evaluation of the number of people in the scene and the detection of those people’s location. This method has the following advantages: (1) It can incorporate prior knowledge, including which areas in the scene can contain people and how probable it is for people to be in those locations; a probability distribution over the number of people in the scene; a probabilistic model of how close together people tend to walk; etc. (2) The complexity of the algorithm depends linearly on the number of people in the scene. When many people are present in the frame, detecting all of them requires more than 1/25th of a second with our current implementation of the algorithm, although it still requires far less than a second. Further optimisations could easily improve this performance. (3) The method is very robust to changes in illumination, shadows and occlusions, and it can easily be made to adapt to non-static background automatically. (4) Thanks to its generative probabilistic nature, the model can easily be incorporated into probabilistic models of motion across consecutive frames, such as Kalman filters or particle filters. II. RELATED WORK Foreground segmentation is typically done by background subtraction [1], [2], or using a probabilistic model of the back- ground [3] after which, for each pixel, a hard decision is made whether to consider that pixel as foreground or as background. The obtained foreground regions are typically noisy, and an extra noise-cleaning step is performed to eliminate foreground regions that are too small, or too short-lived [2]. Connected components of foreground pixels can then be found, resulting in foreground “blobs”. The main problem with this approach is that a single person will easily give raise to multiple blobs, and parts of multiple people will easily be combined into a single blob. Further processing, typically relying on temporal information, is then required to disambiguate the blobs [6]. The blobs are typically used for tracking over multiple frames, and consecutive frames are therefore informative of each other. As a result, most attention has been spent on techniques, including Kalman filters [4], particle filters [5] and graph-based methods [6], which use information from multiple frames to disambiguate the blobs and create accurate tracks from inaccurate observations. Template-based tracking has been applied successfully to a variety of situations, including the tracking of rigid objects [7] and human body pose pose estimation [8]. However, in order to allow for sufficient flexibility, such methods either require adapting the templates over time [7] or extra parameters which need to be optimised to fit the template to the observation. In our approach, the template is adapted to a foreground object’s position in the image, but not to the particular appearance of the foreground objects. This makes our approach fast and insensitive to local optima that may arise when fitting the template to the observation. It also makes the method more robust to noise in the image. III. SEGMENTATION We assume fixed cameras looking straight down from the ceiling. Such a setup was proposed in [9], and has a number of advantages, including reduced numbers of occlusions and, if models are built of the tracked person’s appearance, more holistic appearance models. Example images of our setup are depicted in Figure 1. Our purpose is to track the motion of people in a building without imposing any constraints on the