Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 668–677, Singapore, 6-7 August 2009. c 2009 ACL and AFNLP Unsupervised morphological segmentation and clustering with document boundaries Taesun Moon, Katrin Erk, and Jason Baldridge Department of Linguistics University of Texas at Austin 1 University Station B5100 Austin, TX 78712-0198 USA {tsmoon,katrin.erk,jbaldrid}@mail.utexas.edu Abstract Many approaches to unsupervised mor- phology acquisition incorporate the fre- quency of character sequences with re- spect to each other to identify word stems and affixes. This typically involves heuris- tic search procedures and calibrating mul- tiple arbitrary thresholds. We present a simple approach that uses no thresholds other than those involved in standard ap- plication of χ 2 significance testing. A key part of our approach is using docu- ment boundaries to constrain generation of candidate stems and affixes and clustering morphological variants of a given word stem. We evaluate our model on English and the Mayan language Uspanteko; it compares favorably to two benchmark sys- tems which use considerably more com- plex strategies and rely more on experi- mentally chosen threshold values. 1 Introduction Unsupervised morphology acquisition attempts to learn from raw corpora one or more of the follow- ing about the written morphology of a language: (1) the segmentation of the set of word types in a corpus (Creutz and Lagus, 2007), (2) the cluster- ing of word types in a corpus based on some notion of morphological relatedness (Schone and Juraf- sky, 2000), (3) the generation of out-of-vocabulary items which are morphologically related to other word types in the corpus (Yarowsky et al., 2001). We take a novel approach to segmenting words and clustering morphologically related words. The approach uses no parameters that need to be tuned on data. The two main ideas of the approach are (a) the filtering of affixes by sig- nificant co-occurrence, and (b) the integration of knowledge of document boundaries when gener- ating candidate stems and affixes and when clus- tering morphologically related words. The main application that we envision for our approach is to produce interlinearized glossed texts for under- resourced/endangered languages (Palmer et al., 2009). Thus, we strive to eliminate hand-tuned parameters to enable documentary linguists to use our model as a preprocessing step for their manual analysis of stems and affixes. To require a docu- mentary linguist–who is likely to have little to no knowledge of NLP methods–to tune parameters is unfeasible. Additionally, data-driven exploration of parameter settings is unlikely to be reliable in language documentation since datasets typically are quite small. To be relevant in this context, a model needs to produce useful results out of the box. Constraining learning by using document boundaries has been used quite effectively in un- supervised word sense disambiguation (Yarowsky, 1995). Many applications in information retrieval are built on the statistical correlation between doc- uments and terms. However, we are unaware of cases where knowledge of document boundaries has been used for unsupervised learning for mor- phology. The intuition behind our approach is very simple: if two words in a single document are very similar in terms of orthography, then the two words are likely to be related morphologically. We measure how integrating these assumptions into our model at different stages affects performance. We define a simple pipeline model. After gen- erating candidate stems and affixes (possibly con- strained by document boundaries), a χ 2 test based on global corpus counts filters out unlikely affixes. Mutually consistent affix pairs are then clustered to form affix groups. These in turn are used to build morphologically related word clusters, pos- sibly constrained by evidence from co-occurence of word forms in documents. Following Schone and Jurafsky (2000), clusters are evaluated for 668