International Journal of Advanced Computer Research, Vol 10(47)
ISSN (Print): 2249-7277 ISSN (Online): 2277-7970
http://dx.doi.org/10.19101/IJACR.2019.940134
51
Noise robust speech recognition system using multimodal audio-visual
approach using different deep learning classification techniques
Eslam E. El Maghraby
1*
, Amr M. Gody
2
and Mohamed Hesham Farouk
3
Assistant Lecturer, Department of Computers and Information, Fayoum University, Egypt
1
Faculty, Department of Electrical Engineering, Cairo University, Egypt
2
Professor, Department of Math & Physics, Cairo University, Egypt
3
Received: 06-November-2019; Revised: 15-January-2020; Accepted: 10-March-2020
©2020 Eslam E. El Maghraby et al. This is an open access article distributed under the Creative Commons Attribution (CC BY)
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.Introduction
Speech understanding of human is performed by
using audio and visual information e.g. movements of
speaker lips and tongue.
*Author for correspondence
Speech is a multimodal signal that depends on audio
and visual modalities, so to build a high quality and
noise-robust speech recognition system, it is
important to take advantage of the different
modalities of the speech signal to enhance the speech
understanding process. Using visual modality like lip
movements to identify the spoken words called
lipreading. Lipreading can be used in addition to the
audio signal to enhance speech recognition
Research Article
Abstract
Multimodal speech recognition is proved to be one of the most promising solutions for designing robust speech
recognition system, especially when the audio signal is corrupted by noise. The visual signal can be used to obtain more
information to enhance the recognition accuracy in a noisy system, whereas the reliability of the visual signal is not
affected by the acoustic noise. The critical stage in designing a robust speech recognition system is the choice of an
appropriate feature extraction method for both audio and visual signal and the choice of a reliable classification method
from a large variety of existing classification techniques. This paper proposes an Audio-Visual Speech Recognition (AV-
ASR) system using both audio and visual speech modalities to improve recognition accuracy in a clean and noisy
environment. The contributions of this paper are two-folded: The first is the methodology of choosing the visual features
by comparing different features extraction methods like discrete cosine transform (DCT), blocked DCT, and histograms
of oriented gradients with local binary patterns (HOG+LBP), and applying different dimension reduction techniques like
principal component analysis (PCA), auto-encoder, linear discriminant analysis (LDA), t-distributed Stochastic neighbor
embedding (t-SNE) to find the most effective features vector size. These features are then early integrated with audio
features obtained by Mel frequency Cepstral coefficients (MFCCs) and feed into classification process. The second
contribution of this research is the methodology of developing the classification process using deep learning, comparing
different deep neural network (DNN) architectures like bidirectional long-short term memory (BiLSTM), and convolution
neural network (CNN), with the traditional hidden Markov models (HMM).The effectiveness of the proposed model is
demonstrated on two multi-speakers AV-ASR benchmark datasets named AVletters and GRID with different SNR. The
model performs speaker-independent experiments in AVlettter dataset and speaker-dependent for the GRID dataset. The
experimental results show that early integration between audio feature obtained by a MFCC and visual feature obtained
by DCT demonstrate higher recognition accuracy when used with BiLSTM classifier compared to other methods for
features extraction and classification techniques. In case of GRID, using integrated audio-visual features achieved
highest recognition accuracy of 99.13% and 98.47%, with enhancement up to 9.28% and 12.05% over audio-only for
clean and noisy data respectively. For AVletters, the highest recognition accuracy is 93.33% with enhancement up to
8.33% over audio-only. The obtained results show the performance enhancement compared to previously obtain audio-
visual recognition accuracies on GRID and AVletters and prove the robustness of our BiLSTM-AV-ASR model when
compared with CNN and HMM, because BiLSTM takes into account the sequential characteristics of the speech signal.
Keywords
AV-ASR, DCT, Blocked DCT, PCA, MFCC, HMM, BiLSTM, CNN, AVletters and GRID.