Boosted Audio-Visual HMM for Speech Reading Pei Yin, Irfan Essa, James M. Rehg Georgia Institute of Technology GVU Center / College of Computing Atlanta, GA 30332-0280 USA {pyin, irfan, rehg}@cc.gatech.edu Abstract We propose a new approach for combining acoustic and visual measurements to aid in recognizing lip shapes of a person speaking. Our method relies on computing the max- imum likelihoods of (a) HMM used to model phonemes from the acoustic signal, and (b) HMM used to model visual fea- tures motions from video. One signiﬁcant addition in this work is the dynamic analysis with features selected by Ad- aBoost, on the basis of their discriminant ability. This form of integration, leading to boosted HMM, permits AdaBoost to ﬁnd the best features ﬁrst, and then uses HMM to exploit dynamic information inherent in the signal. 1. Introduction and Related Work Speech reading has been subject of much research due to its obvious beneﬁts in assisting language interpretation by machines [14]. The primary premise of speech reading is to combine audio and visual information to assist in interpret- ing what a person is saying. The success of hidden Markov model (HMM) and the like, for acoustic speech recogni- tion [10], has led to a direct application of the same to computer vision domains, speciﬁcally when dealing with temporally variant streams. This includes visual features of the face that move with lip motions associated with speech. The audio domain has been thoroughly investigated and is equipped with good conceptual features like MFCC, while in contrast, speech reading mostly relies on sometimes ad- hoc and mostly manually deﬁned features. Therefore, the performance of such speech reading depends on the visual features which are not robust enough. One approach to obtain good visual features is to auto- matically select them according to their discrimination abil- ity. Recent work in face detection [15] has proposed a fea- ture selection method based on AdaBoost. AdaBoost has also been used with HMM in the acoustic domain for speech recognition [12, 6] and with dynamic Bayesian networks in an audio-visual approach for speaker detection [1]. Adaboost Neural Networks Adaboost HMM HMM Adaboost Figure 1: Comparison with previous work on boosting with application to speech recognition. Left: Schwenck [12] Middle: Meyer [6]. Right: Our system for speech reading. In this paper we develop a boosted visual feature selec- tion method for HMM in speech reading. We employ audio- visual information coupling by computing maximum like- lihoods (ML) of phonemes over the audio HMM and the visual HMM, and propose frame-level feature selection by AdaBoost for the visual HMM. As shown in Figure 1, the previous work aimed at AdaBoost and HMM integration either relies entirely on AdaBoost for the phoneme clas- siﬁcation while only using the HMM to form the higher- level accoustic model [12] or simply constructs an HMM- AdaBoost ensemble by linearly combining the HMM [6]. These approaches generate some good results, however, the ﬁrst one fails to use HMM to model the inter-frame dynam- ics, while the other one fails to use AdaBoost to select the features. We claim that the better way to integrate those two methods are boosted HMM, which use AdaBoost to ﬁnd the best features ﬁrst, and then use HMM to exploit dynamic in- formation inherent in the signal. The novel aspect of our proposed approach is that HMM is fed with the most informative features over the hardest samples, selected by AdaBoost. The role of AdaBoost in our design, is not only to adaptively concentrate on the hard- est samples, but also to estimate the distribution of informa- Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG’03) 0-7695-2010-3/03 $ 17.00 © 2003 IEEE