Multi-Modal Face Tracking Using Bayesian Network Fang Liu 1 , Xueyin Lin 1 , Stan Z Li 2 , Yuanchun Shi 1 1 Dept. of Computer Science, Tsinghua University, Beijing, China, 100084 2 Microsoft Research Asia, Beijing, China, 100080 liufang@tsinghua.org.cn, lxy-dcs@mail.tsinghua.edu.cn, szli@microsoft.com Abstract This paper presents a Bayesian network based multi- modal fusion method for robust and real-time face tracking. The Bayesian network integrates a prior of second order system dynamics, and the likelihood cues from color, edge and face appearance. While different modalities have different confidence scales, we encode the environmental factors related to the confidences of modalities into the Bayesian network, and develop a Fisher discriminant analysis method for learning optimal fusion. The face tracker may track multiple faces under different poses. It is made up of two stages. First hypotheses are efficiently generated using a coarse-to- fine strategy; then multiple modalities are integrated in the Bayesian network to evaluate the posterior of each hypothesis. The hypothesis that maximizes a posterior (MAP) is selected as the estimate of the object state. Experimental results demonstrate the robustness and real-time performance of our face tracking approach. 1. Introduction Face tracking is important for many vision-based applications such human computer interactions. Different face trackers may be classified into two classes: general tracking methods and learning based methods. General tracking methods use some low level features such as color and contour to track objects including faces [1,2,3,4,5,6,7,8,9,10]. For example, background models are often built and updated to segment the foreground regions [1,2,3]. Monte Carlo methods [4,5,6,7] adopt sampling techniques to model the posterior probability distribution of the object state and track objects through inference in the dynamical Bayesian network. A robust non-parametric technique, the mean shift algorithm, has also been proposed for visual tracking [8,9,10]. In [8] human faces are tracked by projecting the face color distribution model onto the color frame and moving the search window to the mode (peak) of the probability distributions by climbing density gradients. In [9,10] tracking of non-rigid objects is done through finding the most probable target position by minimizing the metric based on Bhattacharyya coefficient between the target model and the target candidates. Some other methods are presented to track human heads, for example, tracking contour through inference of JPDAF-based HMM [11], an algorithm combining the intensity gradient and the color histogram [12] and motion-based tracking with adaptive appearance models [13]. Learning based methods track the faces using learning approaches [14,15,16,17,18]. Results from face detection should help face tracking. In face detection, the goal is to learn, from training face and non-face examples (sub- windows), a highly nonlinear classifier to differentiate the face from non-face pattern. The learning based approach has so far been the most effective for constructing face/non-face classifiers. Taking advantage of the fact that faces are highly correlated, it is assumed that some low dimensional features that may be derived from a set of prototype face images can describe human faces. The system of Viola and Jones [14] makes a successful application of AdaBoost to face detection. Li et al [15] extend Viola and Jones’ work for multi-view faces with an improved boosting algorithm. A face detection based algorithm can be less sensitive to illumination changes and color distracters than the general tracking methods due to the use of face pattern instead of color and contour only. However, they have their own difficulties. First, face detector may mass some faces and contain false alarms. Second, partially occluded faces and rotated faces are more like to be missed. Third, multiple pose face detection is several times more costly than frontal face detector. This paper presents a Bayesian face tracker that aims to track the position, scale and pose of multiple faces. The face tracker unifies low level features such as color and contour, and high level features such as face appearance for robust and real-time tracking of multiple faces. As shown in figure 1, the Bayesian Network [19, 20] includes four components: (1) the prior model, a second order system dynamics, (2) color, (3) edge and (4) face appearance likelihood models. The presented method is different from the previous work for example, Monte Carlo methods [4,5,6,7], in the following ways: First, in the stage of hypotheses generation, our tracker reduces significantly the number of hypotheses needed for robust tracking. In Monte Carlo methods, factored sampling and importance sampling techniques are employed to predict the distributions of the object states.