Multiword Expressions: Some Problems for Japanese NLP Timothy Baldwin Francis Bond CSLI, Stanford University NTT Communication Science Labs 210 Panama St., Stanford, CA 94305, USA 2-4 Hikari-dai, Kyoto 619-0237, Japan tbaldwin@csli.stanford.edu bond@cslab.kecl.ntt.co.jp Abstract Multiword expressions (MWEs) are notoriously difficult to handle in any language, due to syntactic and semantic idiosyncrasies. In this paper, we focus on Japanese in illustrating the types of difficulties MWEs present for NLP systems, in terms of both analysis and generation. We also outline a number of strategies which can be used to overcome such difficulties. 1 Introduction The correct treatment of multiword expressions (MWEs) is increasingly being recognized as an im- portant problem, both in linguistics and natural language processing (Sag et al., 2002). In English, Jackendoff (1997: 156) estimates that the number of MWEs in a speaker’s lexicon is of the same or- der of magnitude as the number of single words. In many on-line lexical resources almost half of the entries are multiword expressions. For example, in WordNet 1.7 (Fellbaum 1999), 41% of the entries are multiword. The definition of a multiword expression in En- glish is often given as a “word with spaces”. This is not applicable to Japanese, which is typically written without spaces. Instead, we define mul- tiword expressions very roughly as “idiosyncratic interpretations that cross word boundaries”. That is, a multiword expression is one that is made up of several units that would normally be segmented into separate words, but whose meaning is not composed purely of the meanings of the individual words. We divide MWEs into two classes: lexicalized phrases and institutionalized phrases. Lexi- calized phrases have at least partially idiosyncratic syntax or semantics, or contain ‘words’ which do not occur in isolation; they can be further broken down into fixed expressions, semi-fixed ex- pressions and syntactically-flexible expres- sions, in roughly decreasing order of lexical rigid- ity. Institutionalized phrases are syntactically and semantically compositional, but occur with markedly high frequency (in a given context). In the next section, we give examples of the types of multiword expressions. This is followed by a discussion of some of the current approaches to handling them in Japanese NLP systems. 2 Types of MWEs Fixed expressions are those that are totally frozen, and appear to act as a single word. An example is mae muki “positive (lit: facing forward)”. The interpretation “positive” is not strictly compositional, in addition, there can be no variation in word order or internal modification. Also, in the normal compositional expression mae- ni muku “to face forward”, there is a postposition, which is absent in the MWE. Another example is chitto mo “(not) at all”. Here, the fi- nal postposition is used with its normal meaning, but the first ‘word’ cannot be used alone. Because Japanese does not explicitly segment words, it is not clear that the above examples are genuinely composed of multiple words. The argument for treating them as MWEs is twofold. First, this analysis captures regularities: chitto mo behaves in many ways like a noun followed by mo . Second, it makes the task of the segmenter simpler: it can separate into small segments, and the grammar can build them into MWEs as necessary. Semi-fixed expressions allow some variation. An example of this is the complex postposition: ni tsuite “about (lit: placed in)”, which can also have the more formal form: ni tsukimashite . This is similar to the English com- plex preposition with regard to . Although the in- dividual parts can be used in other contexts, they have a fixed meaning in combination and allow no In Proceedings of the Eighth Annual Meeting of the Association of Natural Language Processing (Japan), Keihanna, Japan, pp. 379-82.