Unsupervised Segmentation of Bilingual Text Slav Petrov University of California at Berkeley slav@petrovi.de Fall 2004 Abstract This paper presents a system for segmenting bilingual text. A Naive Bayes classifier using n-gram counts is trained on an unlabeled corpus using the Expectation Maximization (EM) algorithm. We show that the performance can be im- proved when a small set of labeled documents is available: the classifier is first trained using the available labeled doc- uments and then the EM technique of iterating (until con- vergence) between probabilistically labeling the unlabeled documents and re-estimating the model’s parameters is ap- plied. While the accuracy of this semi-supervised classifier is higher than the one of the unsupervised classifier, the ac- curacy degrades with each iteration of EM. In the second part Deterministic Annealing (DA) is applied instead of EM and is shown to converge to a better local maximum, thus achieving better segmentation results. But again the use of unlabeled data only decreases the quality of the classifier. 1. Introduction A system that is able to segment text containing two lan- guages could help linguists to better understand how bilin- gual speakers use the different languages. Especially chil- dren growing up in a biligual environment tend to mix lan- guages heavily and even create new composite words out of words from different languages. Analyzing the way that words from different languages are combined could help us understand the way that language is acquired and repre- sented by children. When dictionaries are available the task simplifies to finding the word stem, querying both dictio- naries and resolving ambiguities when a word can be part of both languages. Here we will focus on the situation where no dictionaries are available. We will train a classifier on an unlabeled training set of bilingual test and then evaluate the performance of predicting the language of words from this training set. Unsupervised classification, or clustering, assumes that the training data is not labeled with class information. The goal is to create structure for the data by objectively parti- tioning the data into homogeneous groups where the within group object similarity and the between group object dis- similarity are optimized. A lot of work has been done on language segmentation and classification. Here are a few examples of text segmentation tasks that at first occur to be similar to our problem, but then turn out to have very dif- ferent solutions. When dealing with multi-lingual corpora it is often useful to determine the language used in a doc- ument. A few sentences are usually sufficient for this task ([1]), so that it is even possible to determine the language used in short paragraphs reliably. For document retrieval it is important to clusters text-segments containing simi- lar concepts, rather than different languages and good so- lutions have been proposed ([4], [3]). On a lower level peo- ple are interested in splitting words into morphemes (mor- phemes are the smallest meaning-bearing elements of lan- guage). For agglutinative languages like Finish and Turkish the unsupervised construction of lexica is infeasible due to the large number of different word forms and therefore mor- phemes have to be used [5]. The techniques used in these text segmentations tasks cannot be applied in our environment. We need to be able to determine the language not for a whole paragraph but for single words. In the language classification task evi- dence from a whole paragraph can be collected and it is known where a transition between languages can occur. In our case a transition can occur between each pair of words and getting the boundaries between languages exactly will be crucial. Therefore our task turns out to be closer related to work in completely different domains. When trying to find the secondary structure of a protein [2] or trying to de- termine the borders between coding and non coding DNA [10] a transition can occur after any element. Note: The initial idea was to compare a clustering method working on the character level with a more sophis- ticated model working on the word level. The word level model was not implemented because the data set turned out to be smaller than expected, so that sparsity would have made parameter estimation very hard. This paper is structured as follows: after describing the data set in the next section we will introduce a Naive Bayes classifier and the corresponding graphical model in section three. This classifier will be trained with Expectation Max- imization in section four. Deterministic Annealing as an 1