A Semi-Supervised Learning Approach for Morpheme Segmentation for An Arabic Dialect Mei Yang 1 , Jing Zheng 2 , Andreas Kathol 2 1 Dept. of Electrical Engineering, University of Washington, Seattle, WA, USA 2 SRI International, Menlo Park, CA, USA yangmei@u.washington.edu, {zj,kathol}@speech.sri.com Abstract We present a semi-supervised learning approach which utilizes a heuristic model for learning morpheme segmentation for Ara- bic dialects. We evaluate our approach by applying morpheme segmentation to the training data of a statistical machine trans- lation (SMT) system. Experiments show that our approach is less sensitive to the availability of annotated stems than a pre- vious rule-based approach and learns 12% more segmentations on our Iraqi Arabic data. When applied in an SMT system, our approach yields a 8% relative reduction in the training vocabu- lary size and a 0.8% relative reduction in the out-of-vocabulary (OOV) rate on the test set, again as compared to the rule-based approach. Finally, our approach also results in a modest in- crease in BLEU scores. Index Terms: Iraqi Arabic, morpheme segmentation 1. Introduction Previous studies ([1], [2] and [3]) have shown that the data sparseness problem in translating morphologically complex lan- guages in SMT can be effectively addressed using morpho- logical knowledge. However, morphological analysis, such as tagging, stemming, and morpheme segmentation, is a difficult problem in itself, especially for languages with complex mor- phology but limited resources. In this paper, we present a semi- supervised learning approach for morpheme segmentation for Arabic dialects, which utilizes a heuristic model with an anno- tated lexicon. Our goal is to increase the translation accuracy of SMT systems by applying the morpheme segmentation to the training data in order to reduce the number of word forms in the training vocabulary and the OOV rate on the test set. Arabic has rich morphology resulting in a large number of word forms. An exhaustive morphological analysis for Arabic usually requires a considerable manual effort, as for instance in the case of the Buckwalter morphological analyzer for Mod- ern Standard Arabic (MSA) [8]. In contrast to MSA, Arabic dialects are spoken languages that rarely appear in written ma- terials. Data can be obtained only by manually transcribing audio recordings, a costly effort that prevents large-scale data collections. Because of the paucity of training resources, the computational analysis of Arabic dialectal morphology has not been developed very extensively and hence derived tools such as lexicons, tokenizers, and taggers are rare or non-existent. The first author performed the work described while doing an internship at SRI International. We thank Kristin Precoda for helpful comments on an earlier draft of this paper. 2. Related work Some work has recently been done on the morphological anal- ysis of Arabic dialects. [4] present a morphological analyzer and generator for Arabic dialects (“MAGEAD”), which utilizes a unified processing architecture to generalize morphology to all variants of Arabic. However, since MAGEAD has been developed mainly on the basis of linguistic analysis, adapting MAGEAD for a new Arabic dialect requires a native speaker or trained linguist. On the other hand, [5] and [6] report that a shallow analyzer of morpheme segmentation for Arabic dialects has proven ef- fective in increasing the accuracy of SMT systems. [5] applies a rule-based approach (see section 3.1 for details) to perform morpheme segmentation for Iraqi Arabic with a pre-compiled list of affixes and stems. In subsequent work [6], a more sophis- ticated trie model is employed to strip affixes from words. The trie model consists of trie-based classifiers, which are trained for each affix on a small amount of annotated data and are con- structed in a cascading manner. The model parameters are cho- sen to maximize the accuracy of classifiers on a held-out set with an exhaustive search in the parameter space. Applied to two SMT systems for Iraqi-English and Levantine-English, the trie model outperforms the rule-based model even with little training data, and moreover, further improvement is observed when a hybrid model is used which first uses the rule-based model for segmentation and backs off to the trie model when no analysis can be made. The disadvantage of the trie model lies in the fact that the model parameters might be biased to- ward the training data. In addition, in contrast to the rule-based model it is difficult to incorporate linguistic knowledge into the trie model without increasing the number of parameters and the complexity of the model. Another promising recent approach to developing morpho- logical analyzers for Arabic dialects involves data-sharing tech- niques. [7] investigate part-of-speech (POS) tagging for Egyp- tian Arabic using cross-dialectal data sharing. The authors ex- ploit the commonalities among similar dialects and train a con- textual model jointly on Egyptian and Levantine data. Their experiments show that the data sharing approach yields the best performance of all methods explored. 3. Morpheme segmentation for Arabic dialects In this paper, we present a semi-supervised learning approach for morpheme segmentation for Arabic dialects. Our approach adapts the rule-based segmentation model in [5] into a heuris- tic architecture where various types of resources are incorpo-