Int J Speech Technol (2006) 9: 133–150 DOI 10.1007/s10772-008-9009-1 Arabic speech recognition using SPHINX engine Hussein Hyassat · Raed Abu Zitar Received: 1 October 2008 / Accepted: 9 October 2008 / Published online: 28 October 2008 © Springer Science+Business Media, LLC 2008 Abstract Although the Arab world has an estimated number of 250 million Arabic speakers, there has been little research on Arabic speech recognition when compared to other languages of similar importance (e.g. Mandarin). Due to the lack of diacritic Ara- bic text and the lack of Pronunciation Dictionary (PD), most of previous work on Arabic Automatic Speech Recognition has been concentrated on devel- oping recognizers using Romanized characters i.e. let the system recognizes the Arabic word as an English one, then map it to Arabic word from lookup table that maps the Arabic word to its Romanized pronunciation. In this work, we introduce the ﬁrst SPHINX- IV-based Arabic recognizer and propose an auto- matic toolkit, which is capable of producing (PD) for both Holly Qura’an and standard Arabic lan- guage. Three corpuses are completely developed in this work, namely the Holly Qura’an Corpus HQC-1 about 18.5 hours, the command and control corpus CAC-1 about 1.5 hours and Arabic digits corpus ADC less than one hour of speech. The building process is H. Hyassat Arab Academy of Business and Financial Sciences, Amman, Jordan R. Abu Zitar ( ) School of Computing and Engineering, New York Institute of Technology, Amman, Jordan e-mail: rzitar@nyit.edu completely described. Fully diacritic Arabic transcrip- tions, for all the three corpuses were developed too. SPHINX-IV engine was customized and trained, for both the language model and the lexicon modules shown in the frame work architecture block diagram on next page. Using the three mentioned corpuses; the (PD) de- veloped by our automatic tool with the transcripts, SPHINX-IV engine is trained and tuned in order to develop three acoustic models, one for each corpus. Training is based on an HMM model that is built on statistical information and random variables distribu- tions extracted from the training data itself. New algo- rithm is proposed to add unlabeled data to the training corpus in order to increase the corpus size. This algo- rithm is based on Neural Network conﬁdence scorer and then is used to annotate the decoded speech in or- der to decide whether the proposed transcript is ac- cepted and can be added to the seed corpus or not. The model parameters were ﬁne-tuned using simu- lated annealing algorithm; optimum values were tested and reported. Our major contribution is mainly using the open source SPHINX-IV model in Arabic speech recognition by building our own language and acoustic models without Romanization for the Arabic speech. The system is ﬁne-tuned and data are reﬁned for train- ing and validation. Optimum values for number of Gaussian mixtures distributions and number of states in HMM’s have been found according to speciﬁed per- formance measures. Optimum values for conﬁdence