Adding Semantic Annotation to the Penn TreeBank Paul Kingsbury University of Pennsylvania Department of Linguistics 619 Williams Hall Philadelphia PA 19104 1-267-738-8262 kingsbur@unagi.cis.upenn.edu Martha Palmer University of Pennsylvania Department of Computer Science 256 GRW Philadelphia PA 19104 1-215-898-9513 mpalmer@linc.cis.upenn.edu Mitch Marcus University of Pennsylvania Department of Computer Science 461A GRW Philadelphia PA 19104 1-215-898-2538 mitch@linc.cis.upenn.edu ABSTRACT This paper presents our basic approach to creating Proposition Bank, which involves adding a layer of semantic annotation to the Penn English TreeBank. Without attempting to confirm or disconfirm any particular semantic theory, our goal is to provide consistent argument labeling that will facilitate the automatic extraction of relational data. An argument such as the window in John broke the window and in The window broke would receive the same label in both sentences. In order to ensure reliable human annotation, we provide our annotators with explicit guidelines for labeling all of the syntactic and semantic frames of each particular verb. We give several examples of these guidelines and discuss the inter-annotator agreement figures. We also discuss our current experiments on the automatic expansion of our verb guidelines based on verb class membership. Our current rate of progress and our consistency of annotation demonstrate the feasibility of the task. Keywords Predicate argument structure, semantic annotation, verb classes. 1.INTRODUCTION Recent years have seen major breakthroughs in natural language processing technology based on the development of powerful new techniques that combine statistical methods and linguistic representations [1,2,3,11]. A critical element that is still lacking, however, is detailed predicate-argument structure. In the same way that the existence of the Penn TreeBank [8,9] enabled the development of extremely powerful new syntactic analyzers, moving to the stage of accurate predicate argument analysis will require a body of publicly available training data that explicitly annotates predicate argument positions with labels. A consensus on a task-oriented level of semantic representation has been achieved with respect to English, under the auspices of the ACE program (involving research groups at BBN, MITRE, New York University, and Penn). It was agreed that the highest priority, and the most feasible type of semantic annotation, is co- reference and predicate argument structure for verbs, participial modifiers and nominalizations, to be known as Proposition Bank, or PropBank. This paper describes the PropBank verb predicate argument structure annotation being done at Penn. Similar projects include Framenet [7] and Prague Tectogrammatics [4]. 2.PREDICATE ARGUMENT STRUCTURE The verb of the sentence typically indicates a particular event and the verb’s syntactic arguments are associated with the event participants. In the sentence John broke the window, the event is a breaking event, with John as the instigator and a broken window as the result. The associated predicate argument structure would be break(John, window). Recognition of predicate argument structures is not straightforward since a natural language will have both several different lexical items that can be used to refer to the same type of event as well as several different syntactic realizations of the same predicate argument relations. For example, a meeting between two dignitaries can be described using the verbs meet, visit, debate, consult, etc. 1 , each of which are syntactically interchangeable while lending their own individual semantic nuances. Thus, variations such as the following are seen: John will [meet/visit/debate/consult] (with) Mary. John and Mary [met/visited/debated/consulted] There was a [meeting/visit/debate/consultation] between John and Mary. John had a [meeting/visit/debate/consultation] with Mary. At the same time, not all syntactic frames of a given verb are interchangeable with those of related verbs: Blair [met/consulted/visited] with Bush. The proposal [met/*consulted/*visited] with skepticism. In determining consistent annotations for argument labels of several different syntactic expressions of the same verb, we are relying heavily on recent work in linguistics on word classifications that have a more semantic orientation, such as Levin’s verb classes [6], and WordNet [10]. The verb classes are based on the ability of the verb to occur or not occur in pairs of syntactic frames that are in some sense meaning-preserving (diathesis alternations) [6]. The distribution of syntactic frames in which a verb can appear determines its class membership, to a finer degree than mere semantic similarity can provide. The fundamental assumption is that the syntactic frames are a direct reflection of the underlying semantics; the sets of syntactic frames associated with a particular verb of a particular Levin class reflect underlying semantic components that constrain allowable arguments. These classes, and our refinements on 1 These are representative of the meet class (36) of [6].