Relational Evaluation Schemes Ted Briscoe , John Carroll , Jonathan Graham , Ann Copestake Computer Laboratory University of Cambridge Ted.Briscoe, Ann.Copestake @cl.cam.ac.uk Cognitive and Computing Sciences University of Sussex John.Carroll@cogs.susx.ac.uk Abstract We describe extensions to a scheme for evaluating parse selection accuracy based on named grammatical relations between lemmatised lexical heads. The scheme is intended to directly reflect the task of recovering grammatical and logical relations, rather than more arbi- trary details of tree topology. There is a manually annotated test suite of 500 sentences which has been used by several groups to perform evaluations. We are developing software to create larger test suites automatically from existing treebanks. We are considering alternative relational annotations which draw a clearer distinction between grammatical and logical relations in order to overcome limitations of the current proposal. 1. Introduction We have developed a scheme for evaluating parse selec- tion accuracy based on named grammatical relations be- tween lemmatised lexical heads. The scheme is intended to directly reflect the task of recovering semantic relations, rather than more arbitrary details of tree topology—as with the PARSEVAL scheme, which has been criticised fre- quently for the opaque relationship between its measures and such relations (Carroll et al., 1998; Magerman, 1995; Srinivas, 1997). Carroll et al. (1998) provide more detailed motivation and comparison with other extant schemes. Carroll et al. (1999, 2002 in press) report the develop- ment of a test suite of 500 sentences annotated with gram- matical relations, the specification of the relations, and their criteria of application. The set of named relations are or- ganised as a subsumption hierarchy in which, for exam- ple, subj(ect) underspecifies n(on)c(lausal)subj(ect). There are a total of 15 fully specified relations, however, many of these can be further subclassified; for example, subj re- lations have an initial-gr slot used to encode whether the syntactic subject is logical object (as in passive) and for other marked subjects (such as in locative inversion). Thus a fully specified GR might look like (ncsubj marry couple obj) to encode the subj relation in The couple were married in August, and the GR annotation of each sentence of the test suite consists of a set of GR n-tuples. Figure 1 gives the full set of named relations represented as a subsump- tion hierarchy. The most generic relation between a head and a dependent is dependent. Where the relationship be- tween the two is known more precisely, relations further down the hierarchy can be used, for example mod(ifier) or arg(ument). Relations mod, arg mod, aux, clausal, and their descendants have slots filled by a type, a head, and its dependent; arg mod has an additional fourth slot ini- tial gr. Descendants of subj, and also dobj have the three slots head, dependent, and initial gr. Relation conj has a type slot and one or more head slots. The x and c prefixes to relation names differentiate clausal control alternatives. When the proprietor dies, the establishment should become a corporation until it is either acquired by another proprietor or the government decides to drop it. (ncsubj die proprietor _) (ncsubj become establishment _) (xcomp _ become corporation) (ncsubj acquire it obj) (arg_mod by acquire proprietor subj) (ncmod _ acquire either) (ncsubj decide government _) (xcomp to decide drop) (ncsubj drop government _) (dobj drop it _) (cmod when become die) (cmod until become acquire) (cmod until become decide) (detmod _ proprietor the) (detmod _ establishment the) (detmod _ corporation a) (detmod _ proprietor another) (detmod _ government the) (aux _ become shall) (aux _ acquire be) (conj or acquire decide) Figure 2: Grammatical relation sample annotation. Figure 2 shows the GR encoding of a sentence from the Susanne corpus. The evaluation metric uses the standard precision and recall and F measures over sets of such GRs. Car- roll and Briscoe (2001) also make use of weighted re- call and precision (as implemented in the PARSEVAL software) to evaluate systems capable of returning n-best sets of weighted GRs. The software makes provision for both averaged scores over all relations as well as scores by named relation. It also supports partial scor- ing in terms of non-leaf named relations which under- specify leaf relations. The current specification of the