Data, Annotations and Measures in EASY the Evaluation Campaign for Parsers of French Patrick Paroubek * ,Isabelle Robba * ,Anne Vilnat * ,Christelle Ayache † * LIMSI-CNRS Bˆ at 508 Universit´ e Paris XI 91403 Orsay Cedex France {pap,isabelle.robba,anne.vilnat,}@limsi.fr † ELDA 55-57 rue Brillat Savarin 75013 Paris, France ayache@elda.fr Abstract This paper presents the protocol of EASY the evaluation campaign for syntactic parsers of French in the EVALDA project of the TECH- NOLANGUE program. We describe the participants, the corpus and its genre partitioning, the annotation scheme, which allows for the annotation of both constituents and relations, the evaluation methodology and, as an illustration, the results obtained by one participant on half of the corpus. 1. Introduction EASY is one of the 8 evaluation campaigns about language technology of EVALDA, a project of the TECHNOLANGUE national program 1 .The aim of the EASY campaign (Vil- nat et al., 2004) (Paroubek et al., 2005) is to design and test an evaluation methodology for comparing parsers of French and to produce a treebank by combining automat- ically all the data annotated by the participants. The cor- pus consists of texts taken from various domains (literature, medicine, technical, etc.) and with different genres (news- papers, questions, websites, oral transcriptions, etc.). EASY is a complete protocol of evaluation including corpora con- stitution, manual corpora annotation, evaluation and pro- duction of the treebank. In this paper, we describe the cor- pus and its genre partitioning, the annotation scheme which allows for the annotation of both constituents and relations, the evaluation methodology and, as an illustration, the re- sults of one of the 16 systems participating in the campaign, on half of the corpus, since at the time of writing, all the re- sults were not yet computed (all results will be presented at the conference). 2. State of the art In the early days, parsing evaluation was done by experts who built their opinion from the observation of parses. In many cases, they were using a grid (Blache and Morin, 2003) of parsing features to guide their analysis. Concern- ing the parsing of French, it seems that the first attempt at comparative evaluation dates back to (Abeill´ e, 1991). In an attempt at reducing the objectivity introduced by the par- ticular views that experts might entertain about particular approaches and to improve the reuse of linguistic knowl- edge, people started to employ specific test suites, of which TSNLP is a good example (Oepen et al., 1996). But test 1 TECHNOLANGUE (december 2002 - april 2006) is supported by the 3 French ministries of Culture, Industry and Research. suites do not reflect the distribution of the phenomena en- countered in real corpora since they hold in general a lim- ited number of examples without statistical information. Further, they can only be reused for non-regression tests, because once they have been utilized, it is relatively sim- ple to adapt one’s parser to the specific language items present in the test suite. Finally, they often require a map- ping of syntactic annotations, since there is a good chance that the test suite will encode the syntactic information in a formalism different from the one used by the parser and in general such mapping induces an information loss or is complex to perform. With the advances in computer tech- nology and markup standards, a new solution emerged to get rid of these drawbacks: treebanks. The first and cer- tainly the most famous is the Penn Treebank (Marcus et al., 1993), which was followed by many other develop- ments for different languages, including French (Brant et al., 2002) (Abeill´ e et al., 2000). Since 2002, (Palmer et al., 2005) propose to add semantic role labels to the Penn Treebank. Then in 2004, (Miltsakaki et al., 2004) proposed a large-scale discourse annotation project: the Penn Dis- course Treebank, which aims at identifying discourse con- nectives and their arguments. Although treebanks can solve the problem of language coverage and representation of the linguistic phenomena distribution, if they are large enough and their genre is representative of the material parsed; they do not provide a solution for finding easily an appropri- ate pivot formalism in case the ones used by the parser under test and the treebank are different. To be faithful, an evaluation must preserve both the information present in the reference data and the one output by the parsers. Devising a universal syntactic formalism that enables the description of all linguistic phenomena generally encoun- tered is precisely one of the research objective of parsing. Many proposals have been made, some use annotation map- pings (Gaizauskas et al., 1998), other compare information amounts like (Musillo and Sima’an, 2002) (which unfortu- nately requires the building of one parallel corpus per for- 315