J Sign Process Syst (2012) 68:83–93 DOI 10.1007/s11265-011-0578-x Adaptive Reliability Measure and Optimum Integration Weight for Decision Fusion Audio-visual Speech Recognition R. Rajavel · P. S. Sathidevi Received: 9 October 2009 / Revised: 12 January 2011 / Accepted: 14 January 2011 / Published online: 2 February 2011 © Springer Science+Business Media, LLC 2011 Abstract Audio-visual speech recognition (AVSR) us- ing acoustic and visual signals of speech has received attention recently because of its robustness in noisy environments. An important issue in decision fusion based AVSR system is the determination of appro- priate integration weight for the speech modalities to integrate and ensure better performance under various SNR conditions. Generally, the integration weight is calculated from the relative reliability of two modal- ities. This paper investigates the effect of reliability measure on integration weight estimation and proposes a genetic algorithm (GA) based reliability measure which uses optimum number of best recognition hy- potheses rather than N best recognition hypotheses to determine an appropriate integration weight. Further improvement in recognition accuracy is achieved by optimizing the above measured integration weight by genetic algorithm. The performance of the proposed integration weight estimation scheme is demonstrated for isolated word recognition (incorporating commonly used functions in mobile phones) via multi-speaker database experiment. The results show that the pro- posed schemes improve robust recognition accuracy over the conventional unimodal systems, and a couple of related existing bimodal systems, namely, the base- line reliability ratio-based system and N best recog- nition hypotheses reliability ratio-based system under various SNR conditions. R. Rajavel (B ) · P. S. Sathidevi ECE Department, National Institute of Technology Calicut, Calicut 673601, India e-mail: rettyraja@gmail.com P. S. Sathidevi e-mail: sathi@nitc.ac.in Keywords Audio-visual speech recognition · Side face visual feature extraction · Audio-visual decision fusion · Reliability-ratio based weight optimization · GA based reliability measure 1 Introduction Human’s speech perception is bimodal in nature: hu- man combine audio and visual information in deciding what the others speak. The first AVSR system was reported in 1984 by Petajan [18]. During the last decade more than hundred articles have appeared on AVSR [5, 6, 8, 9, 13, 17, 23, 25]. AVSR systems can enhance the performance of the conventional ASR not only under noisy conditions but also in clean conditions when the talking face is visible [20, 26]. The major advantage of utilizing the acoustic and the visual modalities for speech understanding comes from “Complementarity” [21] of the two modalities and, “Synergy”: Performance of audio-visual speech perception can outperform those of acoustic-only and visual-only perception for diverse noise conditions [22]. Generally, in AVSR systems, the integration can take place either before the two infor- mation sources are processed by a recognizer (early integration/feature fusion) or after they are classified independently (late integration/ decision fusion). Some studies are in favor of early integration [1, 6, 7, 13], and others prefer late integration [25, 19, 24]. Despite all these studies, which underline the fact that speech reading is part of speech recognition in humans, still it is not well understood when and how the acoustic and visual information are integrated. This paper takes the advantages of late integration on practical implementa- tion issue to construct a robust AVSR system.