UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese Toshinobu Ogiso*†, Mamoru Komachi†, Yasuharu Den‡, Yuji Matsumoto† *Department of Corpus Studies, National Institute for Japanese Language and Linguistics (NINJAL) Graduate School of Information Science, Nara Institute of Science and Technology (NAIST) ‡Faculty of Letters, Chiba University 10-2, Midori-cho, Tachikawa-shi, Tokyo JAPAN 190-8561 E-mail: togiso@ninjal.ac.jp, komachi@is.naist.jp, den@cogsci.l.chiba-u.ac.jp, matsu@is.naist.jp Abstract In order to construct an annotated diachronic corpus of Japanese, we propose to create a new dictionary for morphological analysis of Early Middle Japanese (Classical Japanese) based on UniDic, a dictionary for Contemporary Japanese. Differences between the Early Middle Japanese and Contemporary Japanese, which prevent a naïve adaptation of UniDic to Early Middle Japanese, are found at the levels of lexicon, morphology, grammar, orthography and pronunciation. In order to overcome these problems, we extended dictionary entries and created a training corpus of Early Middle Japanese to adapt UniDic for Contemporary Japanese to Early Middle Japanese. Experimental results show that the proposed UniDic-EMJ, a new dictionary for Early Middle Japanese, achieves as high accuracy (97%) as needed for the linguistic research on lexicon and grammar in Japanese classical text analysis. Keywords: Morphological Analysis, Classical Japanese, Early Middle Japanese, Historical Corpus of Japanese 1. Background Recently, the use of corpus linguistics has become popular among Japanese linguists. To facilitate further research on corpus linguistics, the National Institute for Japanese Language and Linguistics (NINJAL) has compiled one of the largest Japanese corpora, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) (Maekawa et al., 2010). Following the same line of research, a diachronic corpus of Japanese is currently under construction. Since corpus linguistics heavily relies on word-segmented corpora, it is important to have morphological annotations for the corpus that is the object of study. However, morphological annotations do not come for free, and thus an automatic morphological analyzer is desired for Japanese corpus linguists. To implement highly accurate and effective morphological analyzers, a carefully constructed wide-coverage dictionary is necessary. It is essential for statistical and machine learning-based approaches to be successful. For example, the state-of-the-art Japanese morphological analyzer MeCab (Kudo et al., 2004) is trained with an electronic dictionary called UniDic 1 on a manually annotated BCCWJ. In UniDic, all entries are based on the definition of short unit word (SUW), which provides word segmentation in uniform size suited for linguistic research. UniDic also achieves high performance in many text genres including literature, spoken texts, and so on (Den et al., 2007). However, the original UniDic is only for the Contemporary Japanese (CJ). We conducted preliminary experiments of morphological analysis of literature written in Early Middle Japanese (EMJ) by adopting the state-of-the-art morphological analyzer MeCab with 1 http://download.unidic.org/ contemporary dictionaries. It turned out that its accuracy on EMJ was considerably lower than the reported accuracy for newswire texts, and completely inadequate for Japanese linguists. One of the reasons is that because there was a massive change in writing style in the Meiji era (1868-1912). Early Middle Japanese is a historical stage of the Japanese language used in the Heian period (A.D. 794 - 1185). In the Heian period, various styles of Japanese literature such as monogatari (tales) and nikki bungaku (diary literature) appeared for the first time in history. Waka (native Japanese poetry) also flourished at this time. For example, masterpieces such as the Tale of Genji, the Tosa Diary, and the Kokin Waka-shū poetry anthology were written in this era, to name a few. Therefore, a morphological analysis of EMJ is especially useful for Japanese historical linguists. As the first step toward rich annotation of linguistic information for historic texts in the diachronic corpus, we propose to start with building an electronic dictionary for morphological analysis adapted for EMJ. Morphological analysis is one of the fundamental annotations for construction of a full-scale corpus. The rest of this paper is organized as follows. Section 2 describes characteristics of Early Middle Japanese. Section 3 explains how we built the UniDic for Early Middle Japanese. Section 4 compares the UniDic for Early Middle Japanese with other dictionaries to show its effectiveness. Section 5 presents conclusions and suggests future direction. 2. Linguistic Characteristics of Early Middle Japanese Early Middle Japanese has various characteristics that distinguish it from CJ in several linguistic fields: lexicon, morphology, syntax, orthography and pronunciation. We 911