Lexical Profiling for Arabic Mohammed Attia, Pavel Pecina, Lamia Tounsi, Antonio Toral and Josef van Genabith School of Computing Dublin City University, Dublin, Ireland E-mail: {mattia, ppecina, atoral, ltounsi, josef}@computing.dcu.ie Abstract We provide lexical profiling for Arabic by covering two important linguistic aspects of Arabic lexical information, namely morphological inflectional paradigms and syntactic subcategorization frames, making our database a rich repository of Arabic lexicographic details. First, we provide a complete description of the inflectional behaviour of Arabic lemmas based on statistical distribution. We use a corpus of 1,089,111,204 words, a pre-annotation tool, knowledge-based rules, and machine learning techniques to automatically acquire lexical knowledge about words’ morpho-syntactic attributes and inflection possibilities. Second, we automatically extract the Arabic subcategorization frames (or predicate-argument structures) from the Penn Arabic Treebank (ATB) for a large number of Arabic lemmas, including verbs, nouns and adjectives. We compare the results against a manually constructed collection of subcategorization frames designed for an Arabic LFG parser. The comparison results show that we achieve high precision scores for the three word classes. Both morphological and syntactic specifications are combined and connected in a scalable and interoperable lexical database suitable for constructing a morphological analyser, aiding a syntactic parser, or even building an Arabic dictionary. We build a web application, AraComLex (Arabic Computer Lexicon), available at: http://www.cngl.ie/aracomlex, for managing and maintaining the standardized and scalable lexical database. Keywords: Arabic; subcategorization frames; morphological analysis; morphological paradigms 1. Introduction In a typical dictionary entry of a word, it is expected to find basic information pertaining to the word’s morphology (possible inflections) and syntax (part of speech, whether it is transitive or intransitive, in the case of verbs, and what prepositions it can co-occur with). Yet, existing Arabic dictionaries have several limitations. Most of them do not rely on a corpus for attesting the validity of their entries (as in a COBUILD approach (Sinclair, 1987)), but they typically include either refinements, expansions, corrections, or organisational improvements over the previous dictionaries. Therefore, they tend to include obsolete words not in contemporary use. Furthermore, they often do not explicitly state all the possible inflection paradigms, and they do not provide sufficient syntactic information on word’s obligatory combinations (or argument list). The aim here is to attempt to resolve these shortcomings by automatically providing a complete description of the inflectional and syntactic behaviour of Arabic lexical entries based on statistical distribution in treebanks and un-annotated corpora. The work described in this paper is divided into two major parts. The first is focused on examining the statistical distribution of inflection paradigms for lexical entries in a large corpus pre-annotated with MADA (Roth et al., 2008), a tool which performs morphological analysis and disambiguation using the Buckwalter morphological analyser (Buckwalter, 2004) and machine learning. The second is related to the automatic extraction of syntactic information, or subcategorization frames, from the Arabic Treebank (ATB) (Maamouri and Bies, 2004). To the best of our knowledge, this is the first attempt at extracting subcategorization frames from the ATB. The subcategorization requirements of lexical entries are important type lexical information, as they indicate the argument(s) a predicate needs in order to form a well- formed syntactic structure. Yet producing such resources by hand is costly and time consuming. Moreover, as Manning (1993) indicates, dictionaries produced by hand will tend to lag behind real language use because of their static nature. Therefore a complete, or at least complementary, automatic process is highly desirable. This paper is structured as follows. In the introduction we describe the motivation behind our work. We differentiate between Modern Standard Arabic (MSA), the focus of this research, and Classical Arabic (CA) which is a historical version of the language. We briefly explain the current state of Arabic lexicography and describe how outdated words are still abundant in current dictionaries. Then we outline the Arabic morphological system to show what layers and tiers are involved in word derivation and inflection. In Section 2, we present the results obtained to date in building and extending the lexical database using a data-driven filtering method and machine learning techniques. We also explain how we use knowledge-based pattern matching in detecting and extracting broken plural forms. In Section 3, we explain the method we followed in extracting and evaluating the subcategorization frames for Arabic verbs, nouns and adjectives. In Section 4, we describe AraComLex, a web application we built for curating and combining our lexical resources. Finally, Section 5 gives the conclusion. 1.1 Modern Standard Arabic vs. Classical Arabic Modern Standard Arabic (MSA), the subject of our research, is the language of modern writing, prepared speeches, and the language of the news. It is the language universally understood by Arabic speakers Proceedings of eLex 2011, pp. 23-33 23