MORPHEME SEGMENTATION BY OPTIMIZING TWO-PART MDL CODES Krista Lagus, Mathias Creutz, Sami Virpioja and Oskar Kohonen Adaptive Informatics Research Centre, Helsinki University of Technology, P.O.Box 5400, FIN-02015 TKK, FINLAND krista.lagus@tkk.fi 1. INTRODUCTION In many real-world NLP applications, a compact yet rep- resentative vocabulary is a necessary ingredient. Words are often thought of as basic units of representation. In highly-inflecting and compounding languages, words can consist of long sequences of meaningful segments, such as prefixes, stems and suffixes: kahvi + n + juo + ja + lle + kin (also for the coffee drinker). Overlooking regularities caused by the common elements accentuates the problem of data sparsity, which is a serious problem for the accu- rate estimation of statistical language models. In statistical language modeling the task is to estimate probabilities of word sequences. The state-of-the art ap- proach in applications such as speech recognition, is to model word sequences as Markov chains, i.e. using the n- gram model. However, while it obtains reasonable perfor- mance in English, with languages like Finnish the n-gram model runs into serious problems having to do with data sparsity. The reason may be understood by looking at how the vocabulary size increases with corpus size in different languages, as shown in Fig. 1. If complete word forms are the basic linguistic segments, the size of the vocabu- laries that are needed for NLP applications become very large, especially for highly inflecting and compounding languages. Finding a better segmentation of the linguistic data is therefore useful. From a linguistic point of view, Finnish frequently employs inflecting (e.g. ’sorme+t’, ’finger+s’) and com- pounding (e.g. ’vasen+k¨ atinen’, ’left+handed’). It is thus very productive on the word level: any number of new words can easily be produced in this manner by a compe- tent language user. Since many word forms in a sentence are thus rare, obtaining reasonable probability estimates for longer word sequences becomes very hard. In contrast, if these long, compound and inflected word forms can be split automatically into reasonable segments, then even if the complete compound has not been seen be- fore, each segment may be familiar, and therefore obtain at least a unigram probability estimate that is more accu- rate than the probability estimate reserved for predicting out-of-vocabulary items. The financial support from Academy of Finland is gratefully ac- knowledged. 0 20 40 60 80 100 120 140 160 180 0 5 10 15 20 25 30 35 40 45 50 Corpus size [1000 words] Unique words [1000 words] Finnish (planned) Estonian (planned) Turkish (planned) Arabic (spontaneous) Arabic (planned) English (planned) English (spontaneous) Figure 1. The number of unique words when more text is observed in different languages. ’Planned’ refers to written news text, whereas ’spontaneous’ consists of tran- scripts of phone conversations. There exist linguistic methods and automatic tools for retrieving morphological analyses for words, e.g., based on the two-level morphology formalism [1]. However, these systems require extensive tailoring by linguistic ex- perts for each new language. Moreover, when new words emerge, their morphological analyses must be manually added to the system. Inspired by the coding philosophy of the Minimum Description Length principle (MDL) by Rissanen [2], we decided to apply MDL to the problem of discovering a segmentation of words into their smaller representative parts. Our hope was that instead of finding, say, sylla- bles, this would lead us to find meaningful parts, that is, linguistic morphemes. Moreover, there were interesting similarities between codes found by MDL and properties of natural languages. Natural language can, of course, be viewed as a code for communicating ideas. Many natural languages seem to exhibit the property that frequent words tend to be shorter, while rare words can be arbitrarily long. Another source of inspiration was an early unsuper- vised morpheme segmentation method called Linguistica [3], which, while reasonably good for English, made as-