Online Japanese Unknown Morpheme Detection using Orthographic Variation Yugo Murawaki, Sadao Kurohashi Graduate School of Informatics, Kyoto University Yoshida-honmachi, Sakyo-ku, Kyoto, 606-8501, Japan murawaki@nlp.kuee.kyoto-u.ac.jp, kuro@i.kyoto-u.ac.jp Abstract To solve the unknown morpheme problem in Japanese morphological analysis, we previously proposed a novel framework of online unknown morpheme acquisition and its implementation. This framework poses a previously unexplored problem, online unknown morpheme detection. Online unknown morpheme detection is a task of finding morphemes in each sentence that are not listed in a given lexicon. Unlike in English, it is a non-trivial task because Japanese does not delimit words by white space. We first present a baseline method that simply uses the output of the morphological analyzer. We then show that it fails to detect some unknown morphemes because they are over-segmented into shorter registered morphemes. To cope with this problem, we present a simple solution, the use of orthographic variation of Japanese. Under the assumption that orthographic variants behave similarly, each over-segmentation candidate is checked against its counterparts. Experiments show that the proposed method improves the recall of detection and contributes to improving unknown morpheme acquisition. 1. Introduction Dictionaries are indispensable resources in natural lan- guage processing. This is especially true for Japanese mor- phological analysis because it is not just part-of-speech (POS) tagging but segmentation is also required. Japanese, like Chinese and Thai, does not delimit words by white space, and due to boundary ambiguities, the joint task of segmentation and POS tagging has a much larger search space than simple POS tagging. In order to limit the search space, the enumeration of morpheme candidates is done by looking up a pre-defined dictionary. Historically, extensive human resources were used to build high-coverage dictionaries (Yokoi, 1995). They now cover almost all but rare proper nouns in newspaper articles. Thus research concentrated on finding an optimal path when a high-coverage dictionary is available, and the F-score of nearly 99% was achieved (Kurohashi et al., 1994; Asahara and Matsumoto, 2000; Kudo et al., 2004). Manually-constructed dictionaries do not, however, suffice for texts other than newspaper articles, web pages in par- ticular, where morphological analysis is prone to more er- rors owing to unknown morphemes, or morphemes not in a dictionary. For example, the unknown verb “ググる (gugu-ru, “to google”) is erroneously segmented into ググ”(gugu) and “”(ru). One solution to the problem is to automatically augment the dictionary by acquiring unknown morphemes from text (Mori and Nagao, 1996). We previously proposed the novel framework of online acquisition of unknown mor- phemes (Murawaki and Kurohashi, 2008). Unlike tradi- tional batch extraction (Mori and Nagao, 1996), the pro- posed method has the ability to acquire unknown mor- phemes in an online mode. The lexicon acquirer processes text on a sentence by sentence basis. It directly updates the dictionary of the analyzer when it successfully disam- biguates an unknown morpheme. The framework of online acquisition poses a previously un- explored problem, online unknown morpheme detection. It is a task of finding morphemes in each sentence that are not listed in a given lexicon. Unlike in English, it is a non- trivial task because, again, Japanese does not delimit words by white space. We have to compare the whole sentence, not words, with the morphemes registered in the dictionary. In this paper, we first present a baseline method of detec- tion that simply uses the output of the morphological ana- lyzer. The baseline method can easily detect most unknown morphemes because they cannot be interpreted as registered morphemes. We then show that it fails to detect some un- known morphemes because they are over-segmented into shorter registered morphemes due to the overly simple sound structure of Japanese. For example, the unknown ad- jective “うざい”(uza-i, “annoying”) is over-segmented into the combination of registered morphemes, “”(u) and ざい”(zai). To cope with this problem, we propose the use of orthographic variation of Japanese. Under the as- sumption that orthographic variants behave similarly, each over-segmentation candidate is checked against its coun- terparts. Experiments show that the proposed method im- proves the recall of detection and contributes to improving unknown morpheme acquisition. 2. Related Work 2.1. Morpheme Extraction from Text Various methods are proposed to extract morphemes from text. For languages delimited by white space, they can be extracted from a word 1 list (Kurimo et al., 2006; Poon et al., 2009). For languages where even word boundaries are unmarked, two major approaches are used. One is to segment the whole corpus and to build the dictionary, or the list of mor- phemes, from the segmented corpus. Segmentation mod- els can be learnt from a manually-segmented training cor- 1 Throughout this paper, we distinguish words from mor- phemes. Each word consists of one or more morphemes. 832