Unsupervised Dynamic Texture Segmentation Using Local Spatiotemporal Descriptors Jie Chen, Guoying Zhao and Matti Pietikäinen Machine Vision Group, Infotech Oulu and Department of Electrical and Information Engineering, P. O. Box 4500 FI-90014 University of Oulu, Finland E-mail:{jiechen, gyzhao, mkp}@ee.oulu.fi Abstract Dynamic texture (DT) is an extension of texture to the temporal domain. In this paper, we address the problem of segmenting DT into disjoint regions in an unsupervised way. Each region is characterized by histograms of local binary patterns and contrast in a spatiotemporal mode. It combines the motion and appearance of DT together. Experimental results show that our method is effective in segmenting regions that differ in their dynamics. 1. Introduction Dynamic textures or temporal textures are textures with motion [3, 6, 12]. There are lots of DTs in real world, including sea-waves, smoke, foliage, fire, shower and whirlwind, etc. Potential applications of DT include remote monitoring and various type of surveillance in challenging environments, such as monitoring forest fires to prevent natural disasters, traffic monitoring, homeland security applications, and animal behavior for scientific studies [2]. Segmentation is one of the classical problems in computer vision [1, 8, 10]. Meanwhile, the segmentation of DTs is a challenging problem compared with the static case because of their unknown spatiotemporal extension. In general, existing approaches of DT segmentation can be generally categorized into supervised and unsupervised methods. For supervised segmentation, a priori information about the textures present is needed. In contrast, unsupervised segmentation does not need a priori information. This makes it a very challenging research problem. However, most of the recent methods need an initialization. Examples of recent approaches are methods based on mixtures of dynamic texture model [2], mixture of linear models [4], multi-phase level sets [5], Gauss-Markov models and level sets [6], Ising descriptors [7], and optical flow [13]. A key problem of DT segmentation is how to combine motion and appearance features. We notice that the recently proposed feature, local binary patterns in three orthogonal planes (LBP-TOP), has a promising ability to describe both the appearance and motions of DT [14]. It also appears to be robust to monotonic gray-scale changes caused, e.g., by illumination variations. In addition, Ojala and Pietikäinen used the local binary pattern and contrast for the unsupervised static texture segmentation and obtained good performance [9]. Our proposed approach is based on the work of [14] and [9]. In this paper, we generalize the frequently cited method of Ojala to DT. Motivated by [14], we also generalize the contrast of a single spatial texture to a spatiotemporal mode (we call it C TOP , i.e., contrast in three orthogonal planes). Combined LBP-TOP and C TOP , we call the generalized method (LBP/C) TOP . It is a theoretically and computationally simple approach to model DT. We then use (LBP/C) TOP histograms for DT segmentation. The extracted features (LBP/C) TOP in a small local neighborhood reflect the spatio- temporal features of DT. To the best of our knowledge, the LBP methods have not been used earlier for DT segmentation. The rest of this paper is organized as follows: In Section 2, we describe the generalized feature (LBP/C) TOP and how to use it for DT segmentation. In Section 3, we show the detailed process of segmentation. In Section 4, some experimental results are presented, followed by discussion in Section 5. 2. Features for segmentation In this section, after a brief review of LBP-TOP and intensity contrast, we describe how to use the generalized (LBP/C) TOP for DT segmentation. 2.1 LBP-TOP/Contrast LBP-TOP is a spatiotemporal descriptor [14]. As shown in Fig. 1, (a) is a sequence of frames (or images) of a DT; (b) denotes the three orthogonal planes or slices XY, XT and YT, where XY is the appearance (or a frame) of DT; XT shows the visual impression of a row changing in time; and YT describes the motion of a column in temporal space; (c) shows how to compute LBP and contrast for each pixel of these three planes. Here, a binary code is produced by thresholding its square neighborhood from XY, XT, YT slices independently with the value of the center pixel; (d) shows how to compute histograms by collecting up the occurrences of different binary patterns from three slices which are denoted as H λ,π (λ=LBP and π=XY, XT, YT). We encode DT by LBP using these three sub-histograms to consider simultaneously the appearance and motions in two directions, i.e., incorporating spatial domain information and two spatiotemporal co-occurrence statistics together. If we concatenate these three sub-histograms H λ,π (λ=LBP, and π=XY, XT, YT) into a single histogram, it is an LBP-TOP feature histogram. The contrast measure C is the difference between the average gray-level of those pixels which have value 1 and those which have value 0 (Fig. 1 (c)). Likewise, we also compute the contrast in the three orthogonal planes, 978-1-4244-2175-6/08/$25.00 ©2008 IEEE