Indonesian Journal of Electrical Engineering and Informatics (IJEEI)
Vol. 10, No. 2, June 2022, pp. 385~397
ISSN: 2089-3272, DOI: 10.52549/ijeei.v10i2.3683 385
Journal homepage: http://section.iaesonline.com/index.php/IJEEI/index
Multimodal Based Audio-Visual Speech Recognition for
Hard-of-Hearing: State of the Art Techniques and Challenges
Shabina Bhaskar
1
, Thasleema Thazhath Madathil
2
1,2
Department of Computer Science, Central University of Kerala, India, 671320
Article Info ABSTRACT
Article history:
Received Jan 25, 2022
Revised Apr 4, 2022
Accepted May 11, 2022
Multimodal Integration (MI) is the study of merging the knowledge acquired
by the nervous system using sensory modalities such as speech, vision, touch,
and gesture. The applications of MI expand over the areas of Audio-Visual
Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion
Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition
(AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand
gestures- facial, lip-hand position, etc., are mainly used as sensory modalities
to develop hard-of-hearing multimodal systems. This paper encapsulates an
overview of multimodal systems available within the literature for hard-of-
hearing studies. This paper also discusses some of the studies related to hard-
of-hearing acoustic analysis. It is observed that very few algorithms have been
developed for hard-of-hearing AVSR than normal hearing. Thus, the study of
audio-visual-based speech recognition systems for hard-of-hearing is highly
demanded by people trying to communicate with natively speaking languages.
This paper also highlights the state-of-the-art techniques in AVSR and the
challenges faced by the researchers in the development of AVSR systems.
Keyword:
Automatic speech recognition,
Sign language,
Audio-visual systems,
Speech recognition,
Neural networks.
Copyright © 2022 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Shabina Bhaskar,
Department of Computer Science,
Central University of Kerala,
Kasaragod, India, 671320.
Email: shabinabhaskar@cukerala.ac.in
1. INTRODUCTION
Multimodal Integration (MI) offers more natural, accessible, and transparent ways of combining
information from different sources. The MI system emerged approximately 30 years before. The MI
experiments were initiated in Audio-Visual Speech Recognition (AVSR) to improve the robustness of speech
recognition in a noisy acoustic environment. The AVSR has been an active research area for many years, and
its transition from machine learning to deep learning improved the recognition rate of AVSR systems. Even
though many applications have been developed for people with hearing loss to recognize the speech of normal
hearing, only very few works have been reported on analyzing speech uttered by the hard-of-hearing person
[1]. The development of language vocabulary in a person depends on the ability to hear. For a normal-hearing
person, language vocabulary develops at their infant stage. The situation changes when a person experience
hearing loss and this causes a delay in developing language and communication skills. This may cause a
negative impact on language and vocabulary development for such person. According to the Hearing Loss
Association of America, the degree of hearing loss can be classified as
• Slight hearing loss (16-25 dB)
• Mild hearing loss (26-40 dB)
• Moderate hearing loss (41-55 dB)
• Moderately severe hearing loss (56-70 dB)
• Severe hearing loss (71-90 dB)
• Profound hearing loss (90 above).