MORPHEME-BASED AUTOMATIC SPEECH RECOGNITION FOR A MORPHOLOGICALLY RICH LANGUAGE – AMHARIC Martha Yifiru Tachbelie, Solomon Teferra Abate, Wolfgang Menzel Department of Informatics, University of Hamburg Vogt-K¨ olln Str. 30, D-22527 Hamburg, Germany abate,tachbeli,menzel@informatik.uni-hamburg.de ABSTRACT Out-of-vocabulary (OOV) words are a major source of er- ror in a speech recognition system and various methods have been proposed to increase the performance of the systems by properly dealing with them. This paper presents an automatic speech recognition experiment conducted to see the effect of OOV words on the performance speech recognition system for Amharic (a morphologically rich language). We tried to solve the OOV problem by using morphemes as dictionary and language model units. It has been found that for a small vocabulary (5k) system morphemes are better lexical and lan- guage modeling units than words. An absolute improvement (in word recognition accuracy) of 11.57% has been obtained as a result of using a morph-based vocabulary. However, for large vocabularies morpheme-based systems did not bring much performance improvement as they suffer from acoustic confusability and limited language model scope while word- based recognizers benefit much from OOV rate reduction. Index Terms— Out-of-Vocabulary problem, Morpheme- based speech recognition, Amharic 1. INTRODUCTION Most large vocabulary speech recognition systems operate with a finite vocabulary. All the words which are not in the system’s vocabulary are considered out-of-vocabulary words. These words are one of the major sources of error in an automatic speech recognition system. When a speech recognition system is confronted with a word which is not in its vocabulary, it may recognize it as a phonetically similar in-vocabulary unit/item. That means the OOV word is mis- recognized. This in turn might cause its neighboring words also to be mis-recognized. [1] indicated the fact that each OOV word in the test data contribute to 1.6 errors on the av- erage. Therefore, different approaches have been investigated to cope with the OOV problem and consequently to reduce the error rate of automatic speech recognition systems. One of these approaches is vocabulary optimization [2], where the vocabulary is selected in a way that it reduces the OOV rate. This involves either increasing the vocabulary size or includ- ing frequent words in a vocabulary. This approach may work for morphologically simple languages like English where a 20k vocabulary has 2% OOV rate and a 65k one has only 0.6% [3]. However, for morphologically rich languages, for which OOV is a severe problem, a much larger vocabulary is re- quired to reach the 1% OOV rate. [3] indicated the fact that for Russian and Arabic 800k and 400k vocabularies are re- quired, respectively for a 1% OOV rate. Increasing the vo- cabulary to alleviate the OOV problem is not the best solution especially for morphologically rich languages as the system complexity increases with the size of the vocabulary. There- fore, modeling sub-word units, particularly morphs, has been used for morphologically rich languages. Many researchers [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] did morpheme-based or sub-word based speech recognition experiments. In this paper, we show the effect of OOV rate on the performance of an Amharic speech recognition system. We investigate options to reduce the OOV problem using mor- phemes as a lexical and language modeling unit and study its effect on the performance of the system. Section 2 gives a brief description of the Amharic word morphology. After reviewing previous works on morpheme-based speech recog- nition for Amharic in Section 3, we present the results of our experiments in Sections 4, 5 and 6. Finally, conclusions are drawn and recommendations for future works are derived in Section 7. 2. AMHARIC MORPHOLOGY Amharic is a member of the Ethio-Semitic languages, which belong to the Semitic branch of the Afro-Asiatic super family [15]. It is related to Hebrew, Arabic, and Syrian. Amharic is a major language spoken mainly in Ethiopia. According to the 1998 census, it is spoken by over 17 million people as a first language and by over 5 million as second language throughout different regions of Ethiopia [16]. Like other Semitic languages such as Arabic, Amharic ex- hibits a root-pattern morphological phenomenon. A root is a