PropBank as a Bootstrap for Richer Annotation Schemes Paul Kingsbury, Benjamin Snyder, Nianwen Xue, and Martha Palmer University of Pennsylvania Department of Computer and Information Science {kingsbur, bsnyder3, xueniwen, palmer}@unagi.cis.upenn.edu Abstract The success of interlingual annotation depends crucially on agreement as to the entities to be annotated, both in terms of the categories of entities and in terms of the names of the entities. Two independent annotation efforts at Penn resulted in annotations which had a comfortable amount of overlap but which still disagreed substantially on even the categories to be marked. The English annotation took a very rich approach to annotation, marking a great many relationships which were not explicitly mentioned in the conference guidelines but which seemed logical and necessary to avoid losing important information. The Chinese annotation, in contrast, took a conservative approach, more in line with the conference guidelines, therefore showing many fewer marked entities. This mismatch can be attributed to the lack of specific guidelines for annotation. In contrast, the PropBank annotation scheme, designed for predicate-argument structure, specifies the same set of markable entities in each language, while still capturing most of the information desired by the conference guidelines. This allows for a considerable savings in time and annotator effort, largely due to the large existing dictionary of predicted argument structure. 1. Introduction The Inter-Lingual (IL) annotation task, as we understand it, can be decomposed into four subtasks. This first subtask is to identify the annotation entities from the text and draw a line between entities that need to be annotated and those that do not. From our experience in annotating both versions of the English text and its Chinese equivalent, there are some non-trivial issues. The first issue is the scope of the annotation. In any given text, in addition to the entities that are explicit in the text—entities that have an explicit lexical anchor, there are also implicit entities that can only be inferred. Even for explicit entities, the question can be asked whether all entities should be annotated. For consistent annotation, the boundaries of the annotation must be clearly specified. The second issue is how the annotation entities should be identified. It seems to be an obvious requirement that they should be identified in a way that facilitates the mapping between annotation entities in one language to those of another, or to some abstract entities that are useful for practical applications. It means that the annotation entities have to be computationally tractable. It might be useful to briefly compare the way annotation entities are identified in the Proposition Bank (PropBank) (Kingsbury and Palmer 2002, Xue and Palmer 2003) with the proposed IL-annotation. In the PropBank, predicates are identified by comparing them to entries in a precompiled list of predicates along with their expected arguments. The arguments are then specified relative to the predicates, via a unique argument label to the predicate. The inter- linguistic mapping can be achieved through the mapping of the entries in the lexicon, which has a finite number of entries. In the proposed IL- annotation, the annotation entities are identified in a much larger name space. It seems to us that some mechanism must be put in place to ensure straightforward inter-linguistic mapping. The second subtask is the classification of the annotation entities. In our opinion, the classification of the annotation entities must be based on considerations of usefulness, feasibility and consistency, the latter two being inter-related. The usefulness should be evaluated in the context of practical applications. A feasible classification is one in which necessary distinctions can be consistently made based on reliable criteria. From our experience the distinction between argument- taking entities such as EVENTS and non- argument-taking entities such as OBJECTS is the