Boosted Audio-Visual HMM for Speech Reading
Pei Yin, Irfan Essa, James M. Rehg
Georgia Institute of Technology
GVU Center / College of Computing
Atlanta, GA 30332-0280 USA
{pyin, irfan, rehg}@cc.gatech.edu
Abstract
We propose a new approach for combining acoustic and
visual measurements to aid in recognizing lip shapes of a
person speaking. Our method relies on computing the max-
imum likelihoods of (a) HMM used to model phonemes from
the acoustic signal, and (b) HMM used to model visual fea-
tures motions from video. One significant addition in this
work is the dynamic analysis with features selected by Ad-
aBoost, on the basis of their discriminant ability. This form
of integration, leading to boosted HMM, permits AdaBoost
to find the best features first, and then uses HMM to exploit
dynamic information inherent in the signal.
1. Introduction and Related Work
Speech reading has been subject of much research due to
its obvious benefits in assisting language interpretation by
machines [14]. The primary premise of speech reading is to
combine audio and visual information to assist in interpret-
ing what a person is saying. The success of hidden Markov
model (HMM) and the like, for acoustic speech recogni-
tion [10], has led to a direct application of the same to
computer vision domains, specifically when dealing with
temporally variant streams. This includes visual features of
the face that move with lip motions associated with speech.
The audio domain has been thoroughly investigated and is
equipped with good conceptual features like MFCC, while
in contrast, speech reading mostly relies on sometimes ad-
hoc and mostly manually defined features. Therefore, the
performance of such speech reading depends on the visual
features which are not robust enough.
One approach to obtain good visual features is to auto-
matically select them according to their discrimination abil-
ity. Recent work in face detection [15] has proposed a fea-
ture selection method based on AdaBoost. AdaBoost has
also been used with HMM in the acoustic domain for speech
recognition [12, 6] and with dynamic Bayesian networks in
an audio-visual approach for speaker detection [1].
Adaboost
Neural
Networks
Adaboost
HMM
HMM
Adaboost
Figure 1: Comparison with previous work on boosting with
application to speech recognition. Left: Schwenck [12]
Middle: Meyer [6]. Right: Our system for speech reading.
In this paper we develop a boosted visual feature selec-
tion method for HMM in speech reading. We employ audio-
visual information coupling by computing maximum like-
lihoods (ML) of phonemes over the audio HMM and the
visual HMM, and propose frame-level feature selection by
AdaBoost for the visual HMM. As shown in Figure 1, the
previous work aimed at AdaBoost and HMM integration
either relies entirely on AdaBoost for the phoneme clas-
sification while only using the HMM to form the higher-
level accoustic model [12] or simply constructs an HMM-
AdaBoost ensemble by linearly combining the HMM [6].
These approaches generate some good results, however, the
first one fails to use HMM to model the inter-frame dynam-
ics, while the other one fails to use AdaBoost to select the
features. We claim that the better way to integrate those two
methods are boosted HMM, which use AdaBoost to find the
best features first, and then use HMM to exploit dynamic in-
formation inherent in the signal.
The novel aspect of our proposed approach is that HMM
is fed with the most informative features over the hardest
samples, selected by AdaBoost. The role of AdaBoost in
our design, is not only to adaptively concentrate on the hard-
est samples, but also to estimate the distribution of informa-
Proceedings of the IEEE International Workshop on Analysis and Modeling of Faces and Gestures (AMFG’03)
0-7695-2010-3/03 $ 17.00 © 2003 IEEE