Automatic Extraction and Evaluation of Arabic LFG Resources Mohammed Attia†‡, Khaled Shaalan‡, Lamia Tounsi†, and Josef van Genabith† †School of Computing, Dublin City University, Ireland ‡The British University in Dubai, UAE E-mail: {mattia,ltounsi,josef}@computing.dcu.ie, khaled.shaalan@buid.ac.ae Abstract This paper presents the results of an approach to automatically acquire large-scale, probabilistic Lexical-Functional Grammar (LFG) resources for Arabic from the Penn Arabic Treebank (ATB). Our starting point is the earlier, work of (Tounsi et al., 2009) on automatic LFG f(eature)-structure annotation for Arabic using the ATB. They exploit tree configuration, POS categories, functional tags, local heads and trace information to annotate nodes with LFG feature-structure equations. We utilize this annotation to automatically acquire grammatical function (dependency) based subcategorization frames and paths linking long- distance dependencies (LDDs). Many state-of-the-art treebank-based probabilistic parsing approaches are scalable and robust but often also shallow: they do not capture LDDs and represent only local information. Subcategorization frames and LDD paths can be used to recover LDDs from such parser output to capture deep linguistic information. Automatic acquisition of language resources from existing treebanks saves time and effort involved in creating such resources by hand. Moreover, data-driven automatic acquisition naturally associates probabilistic information with subcategorization frames and LDD paths. Finally, based on the statistical distribution of LDD path types, we propose empirical bounds on traditional regular expression based functional uncertainty equations used to handle LDDs in LFG. Keywords: Arabic subcategorization frames, Arabic long-distance dependencies, Arabic LFG annotation 1. Introduction The automatic extraction of LFG language resources from treebanks has been described for many languages including English (Cahill et al., 2004), German (Rehbein and van Genabith, 2009), French (Schluter and van Genabith, 2008) and Chinese (Guo et al, 2007). Here we present our research on extracting similar LFG language resources for Arabic from the Penn Arabic Treebank (ATB) (Maamouri and Bies, 2004), which contains 22,524 sentences, 787,235 tokens, and 587,665 words. These language resources consist mainly of two distinct and complementary parts: subcategorization frames and long-distance dependency (LDD) paths. Subcategorization frames describe the argument structure requirement of predicates, or semantic forms, while LDD paths describe the grammatical functions that exist in the path between two co- indexed syntactic elements. These two language resources can be used to augment the output of a probabilistic treebank-based parser with deeper syntactic information including unbounded dependencies (Cahill et al. 2004, 2008), not captured by many current statistical parsing approaches (Bikel, 2004; Petrov et al., 2006). Although this method has been implemented for a number of languages, Arabic (with its rich morphology and relatively free word order) presents particular challenges addressed in this paper. For instance, the extraction of subcategorization frames requires handling the intricate issue of lemmatizing Arabic surface forms which is particularly challenging. Regarding the extraction of LDD paths we discuss particularly interesting grammatical phenomena in Arabic such as resumptive pronouns which mark the lower end in an LDD relationship, fronted subjects, and estimating the maximum length of the path. Moreover, relying on the probability distributions over LDD paths, we are able to propose empirical upper bounds on the lengths of paths. Our annotation, subcategorization frames and LDD resources are based on the formalism of Lexical Functional Grammar (LFG) (Dalrymple, 2001). LFG is a constraint-based non-derivational syntactic framework which essentially distinguishes between two distinct but related levels of representation: c(onstituent)-structure and f(functional)-structure. C- structure takes the form of phrase structure trees, while F-structure is represented in terms of attribute- value structures (or matrices AVMs). F-structure is not directly derived from the c-structure, but the two levels are related through f-equations annotated to CFG tree nodes (Austin, 2001). 2. Arabic Subcategorization Frames The subcategorization requirements of lexical entries are an important type of lexical information, as they indicate the argument(s) a predicate needs in order to form a well-formed syntactic structure. Producing such resources by hand is costly and time consuming. In the current research we create a lexicon of 1947