Creating a Methodology for Large-Scale Correction of Treebank Annotation: The Case of the Arabic Treebank Mohamed Maamouri, Ann Bies and Seth Kulick Linguistic Data Consortium, University of Pennsylvania 3600 Market Street, Suite 810, Philadelphia, PA 19104 USA {maamouri,bies,skulick}@ldc.upenn.edu Abstract The LDC Arabic Treebank team has significantly revised and enhanced its annotation guidelines and annotation procedures over the last two years, with the goal of reducing inconsistency in annotation in the Treebank. We have now completed automatic and significant manual revisions to 738,845 tokens/words in total, bringing them into line as far as possible with the new annotation guidelines and greatly improving the annotation consistency. We created a methodology for large-scale correction of Treebank annotation during the course of this revision process, balancing the need for consistency with tight time constraints for correcting and updating a large amount of data annotated according to previous guidelines. The combination and interleaving of automatic and manual corrections were crucial to the success of the overall revision. We also demonstrate the success of the revision by reporting on an improvement in parsing results. Introduction The LDC Arabic Treebank team has significantly revised and enhanced its annotation guidelines and annotation procedures over the last two years, with the goal of reducing inconsistency in annotation in the Treebank. We have now completed automatic and significant manual revisions to all of ATB1 1 , ATB2 2 and ATB3 3 (738,845 tokens/words in total), bringing them into line as far as possible with the new annotation guidelines 4 and greatly improving the annotation consistency. We created a methodology for large-scale correction of Treebank annotation during the course of this revision process, balancing the need for consistency with tight time constraints for correcting and updating a large amount of data annotated according to previous guidelines. The combination and interleaving of automatic and manual corrections were crucial to the success of the overall revision. This paper describes the correction process, the scope of correction that can be done in this way, and the type of correction that cannot. We also demonstrate the success of the revision by reporting on an improvement in parsing results. The Arabic Treebank The Penn Arabic Treebank (ATB) began in the fall of 2001 (Maamouri and Cieri, 2002) and in five years has completed numerous full releases of morphologically and syntactically annotated data 5 . The ATB corpora are annotated for morphological information, part-of-speech, English gloss (all in the “part-of-speech” or “POS” phase 1 LDC2008E61 - Arabic Treebank Part 1 v 4.0 2 LDC2008E62 - Arabic Treebank Part 2 v 3.0 3 LDC2008E22 - Arabic Treebank Part 3 v 3.1 4 http://projects.ldc.upenn.edu/ArabicTreebank/ 5 Work on additional corpora, both MSA and dialectal, is on- going. of annotation), and for syntactic structure (similar to Treebank II style, Marcus et al., 1993; Marcus et al., 1994; Bies et al., 1995). In addition to the usual issues involved with the complex annotation of data, we have come to terms with a number of issues that are specific to a highly inflected language with a rich history of traditional grammar. In designing our annotation system for Arabic, we relied on traditional Arabic grammar, previous grammatical theories of Modern Standard Arabic and modern approaches, and especially the Penn Treebank approach to syntactic annotation, which we believe can be generalized to the development of annotation systems for other languages (Maamouri and Bies, 2004). We also benefited from the existence at LDC of a rich experience in linguistic annotation. We were innovative with respect to traditional grammar when necessary and when we were sure that other syntactic approaches accounted for the data. Our goal is for the Arabic Treebank to be of high quality, to have a high level of descriptive consistency, and to have credibility with regard to the attitudes and respect for correctness known to be present in the Arab region as well as with respect to the NLP and wider linguistic communities. A comprehensive description is given in Maamouri and Bies (2004) of ‘Modern Standard Arabic’ (MSA) as the language mostly targeted by Arabic NLP research. The Penn Arabic Treebank has therefore so far focused primarily on Arabic newswire text. This paper does not address the question of diacritization directly, but for a complete discussion of vocalization in the Arabic Treebank, see Maamouri, Kulick and Bies 2008. Syntactic clitics affecting the tree were separated after POS tagging and prior to Treebanking, resulting in an increase in the number of tokens in the Treebank data.