Criteria for Identifying and Annotating Caused Motion Constructions in Corpus Data Jena D. Hwang 1 , Annie Zaenen 2 , Martha Palmer 1 1 Department of Linguistics, University of Colorado at Boulder 2 Center for the Study of Language and Information, Stanford University hwangd@colorado.edu, azaenen@stanford.edu, martha.palmer@colorado.edu Abstract While natural language processing performance has been improved through the recognition that there is a relationship between the semantics of the verb and the syntactic context in which the verb is realized, sentences where the verb does not conform to the expected syntax-semantic patterning behavior remain problematic. For example, in the sentence “The crowed laughed the clown off the stage”,a verb of non-verbal communication laugh is used in a caused motion construction and gains a motion entailment that is atypical given its inherent lexical semantics. This paper focuses on our efforts at deﬁning the semantic types and varieties of caused motion constructions (CMCs) through an iterative annotation process and establishing annotation guidelines based on these criteria to aid in the production of a consistent and reliable annotation. The annotation will serve as training and test data for classiﬁers for CMCs, and the CMC deﬁnitions developed throughout this study will be used in extending VerbNet to handle representations of sentences in which a verb is used in a syntactic context that is atypical for its lexical semantics. Keywords: VerbNet, caused motion constructions, semantic coercion, lexical semantics, sentence representation 1. Introduction While natural language processing performance has been improved through the recognition that there is a relation- ship between the semantics of the verb and the syntactic context in which the verb is realized (Guildea and Palmer, 2002), sentences where the verb does not conform to the expected syntax-semantic patterning behavior remain prob- lematic. Consider the following sentences: 1. The goalie kicked the ball into the ﬁeld. 2. The crowd laughed the clown off the stage. 3. The market tilted the economy into recession. These sentences are semantically related – an entity causes a second entity to go along the path described by the prepo- sitional phrase: in 1, the goalie causes the ball to go into the ﬁeld, in 2, the crowd causes the clown to go off the stage, and in 3, the market causes the economy to go into recession. While only the verb in the ﬁrst sentence is generally identi- ﬁed as a verb of motion that can appear in a caused motion context, all three are examples of caused motion construc- tions (CMCs) (Goldberg, 1995). The verb laugh of sen- tence 2 is normally considered an intransitive manner of speaking verb (e.g. The crowd laughed at the clown), but in this sentence, the verb is coerced into the caused motion in- terpretation and the semantics of the verb gives the manner in which the movement happened (e.g. the crowd caused the clown to move off the stage by means of laughing). The verb tilt is a verb of spatial conﬁguration normally taking, as its object argument, the inclined item (e.g. He tilted the bottle). In 3, the verb is not only coerced into the caused motion reading, the coerced meaning is also abstract rather than physical (e.g. He tilted the liquid into his mouth and swallowed). Whether the motion is physical or abstract, the semantics parallel one other: all three sentences have a causal argument responsible for the event, an argument in motion, and a path that speciﬁes the initial, middle, or ﬁnal location, state or condition of the argument in motion. Thus, if the semantic interpretation is strictly based on the expected semantics of the verb and its arguments, it fails to include the relevant information from the CMC. An accu- rate semantic role labelling for such sentences requires that NLP classiﬁers to accurately identify these coerced usages in data. Furthermore, once the CMCs identiﬁed and the semantic roles are properly assigned, the sentence would require an accurate semantic interpretation with appropri- ate representations that include the semantics of the CMCs. Making a semantic analysis available for both conventional and coerced caused motion instances would be useful in making inferences related to the states or locations pre- and post-event (Zaenen et al., 2008). In a pilot study, we determined that CMCs can be automat- ically identiﬁed with high accuracy (Hwang et al., 2010). The pilot study was conducted in a highly controlled en- vironment over a small portion of Wall Street Journal data. This current effort is aimed at providing a larger set of high- quality annotated data for further training and testing of CMC classiﬁers. In this study, we develop detailed crite- ria for identifying CMCs that will aid in the production of consistent annotation with high inter-annotator agreement. In turn, successful annotation of the data will be used to establish whether or not the descriptive criteria are indeed useful in characterizing CMCs. For semantic representation, we turn to the lexical resource VerbNet and the semantic predicates it provides for sen- tence representation. VerbNet groups verbs according to their typical semantic and syntactic behaviors and is built to best handle instances where the verb is used in its typi- cal syntactic context like the one seen in example 1. Verb- 1297