Int J Speech Technol (2006) 9: 133–150
DOI 10.1007/s10772-008-9009-1
Arabic speech recognition using SPHINX engine
Hussein Hyassat · Raed Abu Zitar
Received: 1 October 2008 / Accepted: 9 October 2008 / Published online: 28 October 2008
© Springer Science+Business Media, LLC 2008
Abstract Although the Arab world has an estimated
number of 250 million Arabic speakers, there has
been little research on Arabic speech recognition when
compared to other languages of similar importance
(e.g. Mandarin). Due to the lack of diacritic Ara-
bic text and the lack of Pronunciation Dictionary
(PD), most of previous work on Arabic Automatic
Speech Recognition has been concentrated on devel-
oping recognizers using Romanized characters i.e. let
the system recognizes the Arabic word as an English
one, then map it to Arabic word from lookup table that
maps the Arabic word to its Romanized pronunciation.
In this work, we introduce the first SPHINX-
IV-based Arabic recognizer and propose an auto-
matic toolkit, which is capable of producing (PD)
for both Holly Qura’an and standard Arabic lan-
guage. Three corpuses are completely developed in
this work, namely the Holly Qura’an Corpus HQC-1
about 18.5 hours, the command and control corpus
CAC-1 about 1.5 hours and Arabic digits corpus ADC
less than one hour of speech. The building process is
H. Hyassat
Arab Academy of Business and Financial Sciences,
Amman, Jordan
R. Abu Zitar ( )
School of Computing and Engineering, New York Institute
of Technology, Amman, Jordan
e-mail: rzitar@nyit.edu
completely described. Fully diacritic Arabic transcrip-
tions, for all the three corpuses were developed too.
SPHINX-IV engine was customized and trained,
for both the language model and the lexicon modules
shown in the frame work architecture block diagram
on next page.
Using the three mentioned corpuses; the (PD) de-
veloped by our automatic tool with the transcripts,
SPHINX-IV engine is trained and tuned in order to
develop three acoustic models, one for each corpus.
Training is based on an HMM model that is built on
statistical information and random variables distribu-
tions extracted from the training data itself. New algo-
rithm is proposed to add unlabeled data to the training
corpus in order to increase the corpus size. This algo-
rithm is based on Neural Network confidence scorer
and then is used to annotate the decoded speech in or-
der to decide whether the proposed transcript is ac-
cepted and can be added to the seed corpus or not.
The model parameters were fine-tuned using simu-
lated annealing algorithm; optimum values were tested
and reported. Our major contribution is mainly using
the open source SPHINX-IV model in Arabic speech
recognition by building our own language and acoustic
models without Romanization for the Arabic speech.
The system is fine-tuned and data are refined for train-
ing and validation. Optimum values for number of
Gaussian mixtures distributions and number of states
in HMM’s have been found according to specified per-
formance measures. Optimum values for confidence