Background Subtraction for Temporally Irregular Dynamic Textures Gerald Dalley, Joshua Migdal, and W. Eric L. Grimson Massachusetts Institute of Technology, Computer Science and Artiﬁcial Intelligence Laboratory 77 Massachusetts Ave., Cambridge, MA 02139 {dalleyg,jmigdal,welg}@csail.mit.edu http://people.csail.mit.edu/dalleyg/ Abstract In the traditional mixture of Gaussians background model, the generating process of each pixel is modeled as a mixture of Gaussians over color. Unfortunately, this model performs poorly when the background consists of dynamic textures such as trees waving in the wind and rippling wa- ter. To address this deﬁciency, researchers have recently looked to more complex and/or less compact representa- tions of the background process. We propose a general- ization of the MoG model that handles dynamic textures. In the context of background modeling, we achieve better, more accurate segmentations than the competing methods, using a model whose complexity grows with the underlying complexity of the scene (as any good model should), rather than the amount of time required to observe all aspects of the texture. 1. Introduction A typical approach in current scene analysis systems is to build an adaptive statistical model of the background im- age. When a new frame is presented, pixels that are un- likely to have been generated by this model are labeled as foreground. Stauffer and Grimson [11] represent the back- ground as a mixture of Gaussians (MoG). At each pixel, a collection of Gaussians emits values in RGB (red, green, blue) or some other colorspace. When a pixel value is observed in a new frame, it is matched to the Gaussian most likely to emit it. The Gaussian is then updated with this pixel value using an exponential forgetting scheme that approximates an online k-means algorithm. This allows online adaptation to changing imaging conditions such as shifts in lighting or objects that stop moving. Pixel values are labeled as foreground when they are associated with un- common Gaussians or when they do not match any Gaus- sian well. This approach lends itself to realtime implemen- tation and works well when the camera does not move and neither does the “background.” However, for most appli- cations, objects such as branches and leaves waving in the wind, and waves in water, should be considered as back- ground even though they involve motion. Because these dynamic textures cause large changes at an individual pixel level, they typically fail to be modeled well under a fully independent pixel model. In the middle column of Fig. 5, we see how the MoG foreground mask not only (correctly) includes both pedestrians and the vehicle, but also includes many other pixels due to image noise and moving trees. More recently, Mittal and Paragios [5] used the most re- cent T frames to build a non-parametric model of color and optical ﬂow, with care taken to handle measurement uncer- tainty when estimating kernel density bandwidths. Uncer- tainty management is especially important here due to the inherent ambiguities in local optical ﬂow estimation. While their approach still models the image as a collection of in- dependent pixels, they produce impressive results when the same motions are observed many times in every block of T frames. Challenges are likely to occur when infrequent motions occur, such as trees rustling periodically (but not constantly) due to wind gusts. Better classiﬁcation perfor- mance results in a cost linear in T . For a 200-frame window, their highly optimized implementation is one to two orders of magnitude slower than typical MoG implementations. Sheikh and Shah [10] have also developed a kernel-based model of the background using the most recent T frames. Their kernels are Gaussians over the pixel color and loca- tion. By allowing observed pixels to match kernels cen- tered at neighboring pixel locations, they are able to inter- pret small spatial motions such as trees waving in the wind as being part of the background. Like Mittal and Paragios, they must maintain a long enough kernel history to repre- sent all modes in the local background distribution. For- tunately, for many types of scenes, this history length will be shorter for Sheikh and Shah since information can be “shared” by kernels spawned by nearby pixels. We will show that our approach is able to achieve similar sharing beneﬁts, and we do so by including a small set of easily