Multi-level Background Initialization using Hidden Markov Models Marco Cristani University of Verona Dipartimento di Informatica cristanm@sci.univr.it Manuele Bicego University of Verona Dipartimento di Informatica bicego@sci.univr.it Vittorio Murino University of Verona Dipartimento di Informatica vittorio.murino@univr.it ABSTRACT Most of the automated video-surveillance applications are based on the process of background modelling, aimed at discriminating motion patterns of interest at pixel, region or frame level in a nearly static scene. The issues character- izing an ordinary background modelling process are typically three: the background model representation, the initializa- tion, and the adaptation. This paper proposes a novel ini- tialization algorithm, able to bootstrap an integrated pixel- and region-based background modelling algorithm. The in- put is an uncontrolled video sequence in which moving ob- jects are present, the output is a pixel- and region-level sta- tistical background model describing the static information of a scene. At the pixel level, multiple hypotheses of the background values are generated by modelling the intensity of each pixel with a Hidden Markov Model (HMM), also cap- turing the sequentiality of the diﬀerent color (or gray-level) intensities. At the region level, the resulting HMMs are clus- tered with a novel similarity measure, able to remove moving objects from a sequence, and obtaining a segmented image of the observed scene, in which each region is characterized by a similar spatio-temporal evolution. Experimental trials on synthetic and real sequences have shown the eﬀectiveness of the proposed approach. Categories and Subject Descriptors I.2.10 [Artiﬁcial Intelligence]: Vision and Scene Under- standing—video analysis ; I.5.1 [Pattern Recognition]: Mod- els—statistical ; I.5.3 [Pattern Recognition]: Clustering— similarity measures General Terms Design, Performance Keywords Video Surveillance, pixel-region background initialization, Hidden Markov Model Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro£t or commercial advantage and that copies bear this notice and the full citation on the £rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior speci£c permission and/or a fee. IWVS’03, November 7, 2003, Berkeley, California, USA. Copyright 2003 ACM 1-58113-780-X/03/00011 ...$5.00. 1. INTRODUCTION Analysis and understanding of video sequences is an ac- tive research ﬁeld, whose importance is rapidly increased in the last years, due to the availability of more and more powerful hardware, to the development of eﬀective real-time techniques, and to the potential vastity of the involved ap- plications [30, 6, 28]. Video surveillance is undoubtedly one of the most interesting applications of sequence analysis: human action recognition [31], semantic indexing of video [21], and, more generally, on-line discovering of unusual ac- tivities [12] are all tasks under investigations to partially or fully automate the surveillance. Typically, a video-surveillance system contemplates the monitoring of a site for long periods, using a static camera whose goal is to distinguish (and possibly classify) unusual behaviors from typical ones. To this end, the basic oper- ation needed is the separation of the moving objects, the so-called foreground (FG), from the static information [7], the background (BG). This process is usually called back- ground modelling. The issues characterizing a background modelling process are usually three: model representation, model initializa- tion, and model adaptation. The ﬁrst describes the kind of model (e.g., mixture of Gaussians) used to represent the background; the second one regards the initialization of this model, and the third one relies to the mechanism used for adapting the model to the background changes (e.g., illumi- nation changes). Recently, several techniques have been pro- posed in order to address the representation and the adap- tion issues, whereas the model initialization has received poor attention. In the background model initialization prob- lem, also called bootstrapping [29], the input is a short un- controlled video sequence in which a number of moving ob- jects may be present. The purpose is then to produce a background model describing the observed scene. Actually, most of the background models are built on a set of initial parameters that comes out from a short sequence, in which no foregrounds objects are present [10]. This is a too strong assumption, because in some situations it is diﬃcult or im- possible to control the area being monitored (e.g., public zones), which are characterized by a continuous presence of moving objects, or other disturbing eﬀects. In the literature, the initialization problem is typically disregarded, and only few methods are present. All of these methods discard the solution of computing a simple mean over all the frames, because it produces an image that ex- hibits blending pixel values in areas of foreground presence. A general analysis regarding the blending rate and how it