Indonesian Journal of Electrical Engineering and Informatics (IJEEI) Vol. 10, No. 2, June 2022, pp. 385~397 ISSN: 2089-3272, DOI: 10.52549/ijeei.v10i2.3683  385 Journal homepage: http://section.iaesonline.com/index.php/IJEEI/index Multimodal Based Audio-Visual Speech Recognition for Hard-of-Hearing: State of the Art Techniques and Challenges Shabina Bhaskar 1 , Thasleema Thazhath Madathil 2 1,2 Department of Computer Science, Central University of Kerala, India, 671320 Article Info ABSTRACT Article history: Received Jan 25, 2022 Revised Apr 4, 2022 Accepted May 11, 2022 Multimodal Integration (MI) is the study of merging the knowledge acquired by the nervous system using sensory modalities such as speech, vision, touch, and gesture. The applications of MI expand over the areas of Audio-Visual Speech Recognition (AVSR), Sign Language Recognition (SLR), Emotion Recognition (ER), Bio Metrics Applications (BMA), Affect Recognition (AR), Multimedia Retrieval (MR), etc. The fusion of modalities such as hand gestures- facial, lip-hand position, etc., are mainly used as sensory modalities to develop hard-of-hearing multimodal systems. This paper encapsulates an overview of multimodal systems available within the literature for hard-of- hearing studies. This paper also discusses some of the studies related to hard- of-hearing acoustic analysis. It is observed that very few algorithms have been developed for hard-of-hearing AVSR than normal hearing. Thus, the study of audio-visual-based speech recognition systems for hard-of-hearing is highly demanded by people trying to communicate with natively speaking languages. This paper also highlights the state-of-the-art techniques in AVSR and the challenges faced by the researchers in the development of AVSR systems. Keyword: Automatic speech recognition, Sign language, Audio-visual systems, Speech recognition, Neural networks. Copyright © 2022 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Shabina Bhaskar, Department of Computer Science, Central University of Kerala, Kasaragod, India, 671320. Email: shabinabhaskar@cukerala.ac.in 1. INTRODUCTION Multimodal Integration (MI) offers more natural, accessible, and transparent ways of combining information from different sources. The MI system emerged approximately 30 years before. The MI experiments were initiated in Audio-Visual Speech Recognition (AVSR) to improve the robustness of speech recognition in a noisy acoustic environment. The AVSR has been an active research area for many years, and its transition from machine learning to deep learning improved the recognition rate of AVSR systems. Even though many applications have been developed for people with hearing loss to recognize the speech of normal hearing, only very few works have been reported on analyzing speech uttered by the hard-of-hearing person [1]. The development of language vocabulary in a person depends on the ability to hear. For a normal-hearing person, language vocabulary develops at their infant stage. The situation changes when a person experience hearing loss and this causes a delay in developing language and communication skills. This may cause a negative impact on language and vocabulary development for such person. According to the Hearing Loss Association of America, the degree of hearing loss can be classified as • Slight hearing loss (16-25 dB) • Mild hearing loss (26-40 dB) • Moderate hearing loss (41-55 dB) • Moderately severe hearing loss (56-70 dB) • Severe hearing loss (71-90 dB) • Profound hearing loss (90 above).