J Sign Process Syst (2012) 68:83–93
DOI 10.1007/s11265-011-0578-x
Adaptive Reliability Measure and Optimum Integration
Weight for Decision Fusion Audio-visual Speech
Recognition
R. Rajavel · P. S. Sathidevi
Received: 9 October 2009 / Revised: 12 January 2011 / Accepted: 14 January 2011 / Published online: 2 February 2011
© Springer Science+Business Media, LLC 2011
Abstract Audio-visual speech recognition (AVSR) us-
ing acoustic and visual signals of speech has received
attention recently because of its robustness in noisy
environments. An important issue in decision fusion
based AVSR system is the determination of appro-
priate integration weight for the speech modalities to
integrate and ensure better performance under various
SNR conditions. Generally, the integration weight is
calculated from the relative reliability of two modal-
ities. This paper investigates the effect of reliability
measure on integration weight estimation and proposes
a genetic algorithm (GA) based reliability measure
which uses optimum number of best recognition hy-
potheses rather than N best recognition hypotheses to
determine an appropriate integration weight. Further
improvement in recognition accuracy is achieved by
optimizing the above measured integration weight by
genetic algorithm. The performance of the proposed
integration weight estimation scheme is demonstrated
for isolated word recognition (incorporating commonly
used functions in mobile phones) via multi-speaker
database experiment. The results show that the pro-
posed schemes improve robust recognition accuracy
over the conventional unimodal systems, and a couple
of related existing bimodal systems, namely, the base-
line reliability ratio-based system and N best recog-
nition hypotheses reliability ratio-based system under
various SNR conditions.
R. Rajavel (B ) · P. S. Sathidevi
ECE Department, National Institute of Technology Calicut,
Calicut 673601, India
e-mail: rettyraja@gmail.com
P. S. Sathidevi
e-mail: sathi@nitc.ac.in
Keywords Audio-visual speech recognition ·
Side face visual feature extraction · Audio-visual
decision fusion · Reliability-ratio based weight
optimization · GA based reliability measure
1 Introduction
Human’s speech perception is bimodal in nature: hu-
man combine audio and visual information in deciding
what the others speak. The first AVSR system was
reported in 1984 by Petajan [18]. During the last decade
more than hundred articles have appeared on AVSR
[5, 6, 8, 9, 13, 17, 23, 25]. AVSR systems can enhance the
performance of the conventional ASR not only under
noisy conditions but also in clean conditions when the
talking face is visible [20, 26]. The major advantage
of utilizing the acoustic and the visual modalities for
speech understanding comes from “Complementarity”
[21] of the two modalities and, “Synergy”: Performance
of audio-visual speech perception can outperform those
of acoustic-only and visual-only perception for diverse
noise conditions [22]. Generally, in AVSR systems, the
integration can take place either before the two infor-
mation sources are processed by a recognizer (early
integration/feature fusion) or after they are classified
independently (late integration/ decision fusion). Some
studies are in favor of early integration [1, 6, 7, 13],
and others prefer late integration [2–5, 19, 24]. Despite
all these studies, which underline the fact that speech
reading is part of speech recognition in humans, still it
is not well understood when and how the acoustic and
visual information are integrated. This paper takes the
advantages of late integration on practical implementa-
tion issue to construct a robust AVSR system.