Toward a High Performance Recognizer for Classical Arabic Sounds Mohamed O.M. Khelifa Telecommunications and Embedded Systems Team, ENSIAS, Mohammed V University Rabat, Morocco, khlivamohamed@yahoo.com Yahya O.M. Elhadj Sabbatical leave at SAMoVA Research Team, IRIT, Paul Sabatier University Toulouse, France, Yahya.Elhadj@irit.fr Yousfi Abdellah Faculty of Juridical, Economic and Social Sciences, Mohammed V University Rabat, Morocco yousfi240ma@yahoo.fr Mostafa Belkasmi Telecommunications and Embedded Systems Team, ENSIAS, Mohammed V University Rabat, Morocco m.belkasmi@um5s.net.ma Abstract— This paper is part of a continuous work aiming to build an accurate recognizer for Classical Arabic sounds usable for teaching and learning purposes. Previous efforts of this work focused firstly on the development of a particular sound database from recitations of the Holy Quran to cover classical Arabic sounds; speech signals of this sound database were manually segmented and labeled on three levels: word, phoneme, and allophone. Next, two baseline recognizers were built to validate the speech segmentation on both phoneme and allophone levels and also to test the feasibility of the sounds' recognition intended target. This current phase - which is a PhD work - considers the development of an elaborated recognizer, by considering the basic sounds and looking for their distinctive features (e.g. duration, energy, etc.) to determine which ones will be particularly helpful to identify the phonological variation of the basic sound. Here, we present the first results of the basic sounds recognition obtained so far. Keywords—Speech recognition; Sound databases; Hidden Markov Models; Speech segmentation; Pronunciation errors detection. I. INTRODUCTION Automatic Speech Recognition (ASR) technology allows a machine to identify the textual content of a pronounced speech; depending on the type of application, the textual content of the speech might be further processed to be suitable for a specific task [1-4]. Early ASR applications were limited to handle relatively simple tasks, such as for example speaker- dependent isolated keywords recognition. Nowadays, a myriad type of ASR applications appeared, covering a wide range of tasks such as for example, remote control using phones, helping the disabled and persons with special needs, speaker identification, language identification, archiving, search and retrieval, language acquisition, and so on. Despite the large use of speech recognition technology in foreign languages, the Arabic language is still suffering from scarcity of mature ASR applications, especially for language learning and evaluation. One Distinguished application of Arabic Speech Recognition is the teaching of the classical Arabic sound system. Although classical Arabic is not used in the daily communication, it is required to learn the Holy Quran and the old poetry heritage. Moreover, it can open the door for several kind of Islamic applications. This work is a continuation of previous efforts targeted to develop a high performance recognizer for classical Arabic sounds to be used for teaching and learning purposes [5]. First stages of these efforts were limited to the preparation of an appropriate sound database to support the ultimate goal [6-9]. Thus, ten recitations of a well chosen part of the holy Quran were recorded and manually segmented and annotated on three levels: word, phoneme, and allophone. To validate this particular sound database and to test the feasibility of the goal, two baseline recognizers for phonemes and allophones were developed [10-11]. This current phase aims to develop an accurate recognizer, by firstly considering the basic sounds and then exploring different features of each basic sound separately to determine the most pertinent ones for the identification of its phonological variation. We mean by the basic sounds the basic phonemes without any phonological variation and even without considering the phonemes germination (the doubling). In this paper, we present the results of the basic sounds recognition obtained so far. We will follow the same methodology employed in the previous baseline recognizers to be able to make appropriate comparisons and to give pertinent suggestions and recommendations for our future steps. Thus, Hidden Markov Models Toolkit (HTK) 1 is used as development environment of the recognizer as done in the previous works; each basic sound is modeled by 3-emitting states HMM with a mixture of Gaussian models [4, 12]. The rest of the paper is organized as follows: section 2 gives a quick overview of the previously developed sound database (we will call it "CA Sound Database"); section 3 presents our adaptation of the "CA Sound Database" in order to be annotated in terms of basic sounds; section 4 is dedicated to the development of the recognizer; section 5 discusses the results and the future improvements; the last section concludes the paper and highlights the perspectives. 1 http://htk.eng.cam.ac.uk