Dialectal Arabic Telephone Speech Corpus: Principles, Tool design, and Transcription Conventions Mohamed Maamouri, Tim Buckwalter, Christopher Cieri Linguistic Data Consortium University of Pennsylvania maamouri@ldc.upenn.edu , timbuck2@ldc.upenn.edu , ccieri@ldc.upenn.edu Abstract The present paper presents the experience gained at LDC in the collection and transcription of a corpus of conversational telephone speech in dialectal Arabic. The paper will cover the following: (a) Arabic language background; (b) objectives, principles, and methodological choices of dialectal Arabic transcription, (c) conceptualization and design features of LDC’s ‘Arabic Multi-Dialectal Transcription Tool’ (AMADAT), and (d) a brief description of the conversational Levantine Arabic transcription guidelines and annotation conventions. 1.0 Introduction: Arabic Linguistic Background The Arabic language is a ‘linguistic continuum’ (Hymes, 1973) with two major poles representing an Arabic Standard, the language of most written and formal spoken discourse, and a collection of related Arabic dialects, which are mainly spoken and which present significant phonological, morphological, syntactic, and lexical differences among themselves and when compared to the standard written forms. This situation, usually referred to as ‘diglossia’ (Ferguson, 1959), presents some challenging issues for Arabic spoken language technologies, including corpus creation to support Speech-to-Text (STT) systems, since the spoken Arabic dialects are not officially written and have no standardized writing in spite of growing but still relatively small and not wholly conventionalized web activities. A significant amount of linguistic variation occurs and produces many variant forms which are difficult to identify and regroup. 1.1 Arabic Dialectal Variation The diglossic situation described above mainly represents a significant linguistic distance between all Arabic dialects and the ‘fusha,’ commonly identified as ‘Modern Standard Arabic’ or MSA, though the latter term does not cover all features of the former. This linguistic distance is characterized by substantial linguistic variation, mostly phonological, morphological, and lexical. The Arabic dialectal variation is significant not only between major dialects, for example, Egyptian, Levantine, Gulf, Maghreb, but also between the regional variants of a major dialect, for example, Northern and Southern Levantine. Sound change has occurred in all Arabic dialects. In Levantine Arabic (LA), for instance, the sound /q/ is pronounced /q/ but also /’/, /g/ and /k/. The glottal stop is mostly deleted in medial and word final position with compensatory lengthening of the word internally (ra?s ‘head’ becomes ra:s and bi?r ‘a well’ becomes bi:r ). Moreover, interesting cases of chain shifts with counter- feeding rule interactions also occur as in MSA fa?r ‘mouse’ goes to dialectal fa:r while MSA faqr ‘poverty’ goes to faqr but also to fa?r -- now meaning ‘poverty’ while it was ‘mouse’ earlier on. An important consequence of chain shifts is the multiplication of lexical ambiguity in the language. The complexity of the above situation is compounded by the existence of significant differences between the sound changes of the various Arabic dialects. In Egyptian Arabic, for instance, MSA /θ/ becomes both /t/ and /s/ while /g/ is used to replace /j/ and /?/ to replace /q/. In Sudanese Arabic, MSA /q/ is replaced by /g/ and the uvular [ ϒ ]. All of the above creates an important amount of confusion which needs to be addressed and taken into account in any dialectal transcription task. 1.2 Pertinent Linguistic Features and the Dialectal Arabic Transcription Challenge The description of Arabic dialect differences above, which does not even consider linguistic variation conditioned by age, gender, urbanity, rurality or style, shows the complexity of any speech-to-text (STT) transcription task. It also predicts the challenges facing any linguistic transcription methodology which seeks to closely represent sound features without capturing the distinctions that matter to native speakers. In the case of a conversational Levantine Arabic corpus building, a Romanized orthography-based transcription can bypass the issues of phonemic sound shifts and the resulting variation by, for example, giving a faithful rendering of Levantine pronunciation characteristics. However, such a Romanized transcription would be machine readable and usable only for, and within the framework of, a single dialect system: LA. A Romanized transcription output will necessarily lead to the following tasks: (a) a long LA- related disambiguation process, (b) a comprehensive LA- specific lexicon and grammar, and (c) significantly longer annotators' training periods for better familiarization with transcription symbols. Looking around us for examples of speech to text transcription practices which have been successfully used to support speech technologies (not just among linguists), one may ponder the wisdom of an orthography designed to write different spoken dialects (or different variants of one of them) more similarly than they sound, roughly as English orthography does world Englishes. The above idea may seem too far-fetched but the Arabic language continuum is similar in many ways to the English one and presents the following potentially useful features: (a) there exists an important core of mutual intelligibility between MSA and the dialects, (b) there is a high level of similarity in morphological form and