The LIMSI RT-04 BN Arabic System Abdel. Messaoudi, * † Lori Lamel and Jean-Luc Gauvain Spoken Language Processing Group LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE {abdel,gauvain,lamel}@limsi.fr ABSTRACT This paper describes the LIMSI Arabic Broadcast News system used in the RT-04F evaluation. The 10x system uses a 3 pass de- coding strategy with MAP adapted gender- and bandwidth-specific acoustic models, vowelized 65k pronunciation lexicon, and a word class 4-gram language model where a word class regroups all vow- elized forms for each non-vowelized entry. The primary system was trained on about 150 hours of audio data and almost 600 million words of Arabic texts. A contrast sys- tem, trained only on resources distributed by the LDC, was also submitted. The word error rates of the primary system were 16.0% and 18.5% on the dev04 and eval04 data, and the respective word error rates were 17.6% and 20.2% for the contrast system. 1. INTRODUCTION This paper describes some recent work improving our broadcast news transcription system for Modern Standard Arabic as described in [10]. By Modern Standard Arabic we refer to the spoken version of the official written lan- guage, which is spoken in much of the Middle East and North Africa, and is used in major broadcast news shows. At LIMSI we have found that porting a broadcast news sys- tem developed for American English to several other lan- guages was quite straightforward if the required resources are available. Our observation is that given a similar quan- tity and quality of linguistic resources (audio data, language model training texts, and a consistent pronunciation lexicon) somewhat comparable recognition accuracies results can be obtained in different languages [7]. The Arabic language poses challenges somewhat different from the other languages (mostly Indo-European Germanic or Romance) we have worked with. Modern Standard Ara- bic is that which is learned in school, used in most news- papers and is considered to be the official language in most Arabic speaking countries. In contrast many people speak in dialects for which there is only a spoken from and no recog- nized written form. Arabic texts are written and read from right-to-left and the vowels are generally not indicated. It is a strongly consonantal language with nominally only three vowels, each of which has a long and short form. Arabic is a *† Visiting scientist from the Vecsys Company. highly inflected language, and as a result has many different word forms for a given root, produced by appending articles at the word beginning “the, and, to, from, with, ...”) and pos- sessives (“ours, theirs, ...”) at the word end. The different right-to-left nature of the Arabic texts required modification to the text processing utilities. The texts are non-vowelized, meaning the short vowels and gemination are not indicated. There are typically several possible (generally semantically linked) vowelizations for a given written word, and the word- final vowel varies as a function of the word context. For most written texts it is necessary to understand the text in order to know how to vowelize and pronounce it correctly. 2. ARABIC LANGUAGE RESOURCES The audio corpus contains about 150 hours of radio and television broadcast news data from a variety of sources in- cluding VOA, NTV from the TDT4 corpus, Cairo Radio from FBIS (recorded in 2000 and 2001 and distributed by the LDC), and Radio Elsharq (Syria), Radio Kuwait, Radio Orient (Paris), Radio Qatar, Radio Syria, BBC, Medi1, Al- jazeera (Qatar), TV Syria, TV7, and ESC [10]. For the 70 hours of TDT4 and FBIS data, we used time- aligned segmented transcripts, shared with us by BBN, which had been derived from the associated closed-captions and commercial transcripts. These transcripts are not vow- elized as is typically the case for Arabic texts, and have about 520k words (45k distinct forms). The remaining audio data were collected during the period from September 1999 through October 2000, and from April 2001 through the end of 2002 [10]. These data were manu- ally transcribed using an Arabic version of Transcriber [1] and an Arabic keyboard. The manual transcriptions are vowelized, enabling accurate modeling of the short vowels, even though these are not usually present in written texts. This is different from the approach taken by Billa et al. [2] where only characters in the non-vowelized written form are modeled. Each Arabic character, including short vowel and geminate markers, is transliterated to a single ascii charac- ter. Transcription conventions were developed to provide guidance for marking vowels and dealing with inflections