Mohamed Elmahdy, Mark Hasegawa-Johnson, & Eiman Mustafawi International Journal of Computational Linguistics (IJCL), Volume (3): Issue (1): 2012 88 Hybrid Phonemic and Graphemic Modeling for Arabic Speech Recognition Mohamed Elmahdy mohamed.elmahdy@qu.edu.qa Qatar University Qatar Mark Hasegawa-Johnson jhasegaw@illinois.edu University of Illinois USA Eiman Mustafawi eimanmust@qu.edu.qa Qatar University Qatar Abstract In this research, we propose a hybrid approach for acoustic and pronunciation modeling for Arabic speech recognition. The hybrid approach benefits from both vocalized and non-vocalized Arabic resources, based on the fact that the amount of non-vocalized resources is always higher than vocalized resources. Two speech recognition baseline systems were built: phonemic and graphemic. The two baseline acoustic models were fused together after two independent trainings to create a hybrid acoustic model. Pronunciation modeling was also hybrid by generating graphemic pronunciation variants as well as phonemic variants. Different techniques are proposed for pronunciation modeling to reduce model complexity. Experiments were conducted on large vocabulary news broadcast speech domain. The proposed hybrid approach has shown a relative reduction in WER of 8.8% to 12.6% based on pronunciation modeling settings and the supervision in the baseline systems. Keywords: Arabic, Acoustic modeling, Pronunciation modeling, Speech recognition. 1. INTRODUCTION Arabic is a morphologically very rich language that is inflected by gender, definiteness, tense, number, case, humanness, etc. Due to Arabic morphological complexity, a simple lookup table for phonetic transcription -essential for acoustic and pronunciation modeling- is not appropriate because of the high out-of-vocabulary (OOV) rate. For instance, in Arabic, a lexicon of 65K words in the domain of news broadcast leads to an OOV rate in the order of 5% whilst in English, it leads to an OOV rate of less than 1%. Furthermore, Arabic is usually written without diacritic marks. Text resources without diacritics are known as non-vocalized (or non-diacritized). These diacritics are essential to estimate short vowels, nunation, gemination, and silent letters. The absence of diacritic marks leads to a high degree of ambiguity in pronunciation and meaning [10, 13]. In order to train a phoneme-based acoustic model for Arabic, the training speech corpus should be provided with fully vocalized transcriptions. Then, the mapping from vocalized text to phonetic transcription is almost a one-to-one mapping [10]. State of the art techniques for Arabic vocalization are usually done in several phases. In one phase, orthographic transcriptions are manually written without diacritics. Afterwards, statistical techniques are applied to restore missing diacritic marks. This process is known as “automatic diacritization”. Automatic diacritization techniques can results in diacritization WER of 15%-25% as reported in [10, 12, 16].