National Conference ETEIC-2012 Proceedings, April 6 th -7 th , 2012, Anand Engineering College, Agra (ETEIC 2012) ISBN 978-93-81583-35-7 408 Bimodal Speech Recognition: A Review Priyanka Varshney1, Prashant Upadhyaya2, Omar Farooq3 1 priyankavarshney88@gmail.com, 2 upadhyaya.prashant@rediffmail.com 3 Department of Electronics Engineering, Aligarh Muslim University, Aligarh-202002, India. 3 omarfarooq70@gmail.com ABSTRACT Speech recognition by machine is crucial ingredients for many important applications of human-machine interface. The combination of audio and visual information promises higher recognition accuracy and robustness in comparison to audio information only. This paper gives an overview of different approaches used for speech recognition. This paper helps in choosing the technique along with their relative merits & demerits for audio visual speech recognition. This paper concludes with the decision of developing technique to increase the accuracy of audio visual recognition system in the noisy background conditions. Index Terms: Speech Recognition, Human Computer Interface, Discrete Cosine Transform (DCT), Mel Frequency Cepstral Coefficient (MFCC), Hidden Markov Model (HMM). I. INTRODUCTION Due to the advancement in computing power, speech recognition has found applications in many consumer products such as in mobile phones, computers, voice dialling and even as a password [1]. Since some of these applications may not have noise free background conditions [2], therefore robustness is an important issue for practical application. Human speech perception is bimodal in nature as humans combine audio and visual information, the latter being used especially in noisy environments [3].Visual information is also beneficial when the listener suffers from impaired hearing or when the acoustic signal is degraded [4]. An Automatic Speech Recognition (ASR) system typically consists of a microphone unit, computer, speech recognition software. The basic units of ASR are shown in Figure 1. Fig. 9: General procedure for audio visual speech recognition Hena Raihan et al [5] integrated 2D-DCT based visual features along with the audio MFCC features to recognize Hindi fricative. Linear and Quadratic Discriminant function based classifier was used for the identification of three fricatives. The proposed visual features when integrated with the MFCC based features shown improvement over audio Pre-processing Feature extraction Face detection ROI estimation Lip extraction Audio- visual feature integration Audio recognition Visual recognition Audio-visual recognition Feature extraction Audio Video C l a s s if i e r