International Journal of Advanced Computer Research, Vol 10(47) ISSN (Print): 2249-7277 ISSN (Online): 2277-7970 http://dx.doi.org/10.19101/IJACR.2019.940134 51 Noise robust speech recognition system using multimodal audio-visual approach using different deep learning classification techniques Eslam E. El Maghraby 1* , Amr M. Gody 2 and Mohamed Hesham Farouk 3 Assistant Lecturer, Department of Computers and Information, Fayoum University, Egypt 1 Faculty, Department of Electrical Engineering, Cairo University, Egypt 2 Professor, Department of Math & Physics, Cairo University, Egypt 3 Received: 06-November-2019; Revised: 15-January-2020; Accepted: 10-March-2020 ©2020 Eslam E. El Maghraby et al. This is an open access article distributed under the Creative Commons Attribution (CC BY) License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1.Introduction Speech understanding of human is performed by using audio and visual information e.g. movements of speaker lips and tongue. *Author for correspondence Speech is a multimodal signal that depends on audio and visual modalities, so to build a high quality and noise-robust speech recognition system, it is important to take advantage of the different modalities of the speech signal to enhance the speech understanding process. Using visual modality like lip movements to identify the spoken words called lipreading. Lipreading can be used in addition to the audio signal to enhance speech recognition Research Article Abstract Multimodal speech recognition is proved to be one of the most promising solutions for designing robust speech recognition system, especially when the audio signal is corrupted by noise. The visual signal can be used to obtain more information to enhance the recognition accuracy in a noisy system, whereas the reliability of the visual signal is not affected by the acoustic noise. The critical stage in designing a robust speech recognition system is the choice of an appropriate feature extraction method for both audio and visual signal and the choice of a reliable classification method from a large variety of existing classification techniques. This paper proposes an Audio-Visual Speech Recognition (AV- ASR) system using both audio and visual speech modalities to improve recognition accuracy in a clean and noisy environment. The contributions of this paper are two-folded: The first is the methodology of choosing the visual features by comparing different features extraction methods like discrete cosine transform (DCT), blocked DCT, and histograms of oriented gradients with local binary patterns (HOG+LBP), and applying different dimension reduction techniques like principal component analysis (PCA), auto-encoder, linear discriminant analysis (LDA), t-distributed Stochastic neighbor embedding (t-SNE) to find the most effective features vector size. These features are then early integrated with audio features obtained by Mel frequency Cepstral coefficients (MFCCs) and feed into classification process. The second contribution of this research is the methodology of developing the classification process using deep learning, comparing different deep neural network (DNN) architectures like bidirectional long-short term memory (BiLSTM), and convolution neural network (CNN), with the traditional hidden Markov models (HMM).The effectiveness of the proposed model is demonstrated on two multi-speakers AV-ASR benchmark datasets named AVletters and GRID with different SNR. The model performs speaker-independent experiments in AVlettter dataset and speaker-dependent for the GRID dataset. The experimental results show that early integration between audio feature obtained by a MFCC and visual feature obtained by DCT demonstrate higher recognition accuracy when used with BiLSTM classifier compared to other methods for features extraction and classification techniques. In case of GRID, using integrated audio-visual features achieved highest recognition accuracy of 99.13% and 98.47%, with enhancement up to 9.28% and 12.05% over audio-only for clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to 8.33% over audio-only. The obtained results show the performance enhancement compared to previously obtain audio- visual recognition accuracies on GRID and AVletters and prove the robustness of our BiLSTM-AV-ASR model when compared with CNN and HMM, because BiLSTM takes into account the sequential characteristics of the speech signal. Keywords AV-ASR, DCT, Blocked DCT, PCA, MFCC, HMM, BiLSTM, CNN, AVletters and GRID.