Experiments with syllable-based English-Zulu alignment Gideon Kotz´ e, Friedel Wolff University of South Africa kotzegj@unisa.ac.za, wolfff@unisa.ac.za Abstract As a morphologically complex language, Zulu has notable challenges aligning with English. One of the biggest concerns for statistical machine translation is the fact that the morphological complexity leads to a large number of words for which there exist very few examples in a corpus. To address the problem, we set about establishing an experimental baseline for lexical alignment by naively dividing the Zulu text into syllables, resembling its morphemes. A small quantitative as well as a more thorough qualitative evaluation suggests that our approach has merit, although certain issues remain. Although we have not yet determined the effect of this approach on machine translation, our first experiments suggest that an aligned parallel corpus with reasonable alignment accuracy can be created for a language pair, one of which is under-resourced, in as little as a few days. Furthermore, since very little language-specific knowledge was required for this task, our approach can almost certainly be applied to other language pairs and perhaps for other tasks as well. Keywords: machine translation, morphology, alignment 1. Introduction Zulu is an agglutinative language in the Bantu language family. It is written in a conjunctive way which results in words that can contain several morphemes. Verbs are es- pecially prone to complex surface forms. Although word alignment algorithms might have enough information to align all the words in an English text to their Zulu coun- terparts, the resulting alignment is not very useful for tasks such as machine translation because of the sparseness of morphologically complex words, even in very large texts. This is compounded by the fact that Zulu is a resource- scarce language. A possible solution for this problem is to morphologically analyze each word and using the resulting analysis to split it into its constituent morphemes. This enables a more fine-grained alignment with better constituent convergence. Since verb prefixes often denote concepts such as subject, object, tense and negation, it would be ideal if they would align with their (lexical) counterparts in English. Figure 1 shows an example of a Zulu-English alignment before and after the segmentation. Here, it is clear that not only more alignments can be made, but in some cases, such as with of, we have better convergence as well. Figure 1: An example of an English-Zulu alignment before and after morphological segmentation. Variations of this strategy have been followed with some success with similar language pairs such as English-Swahili (De Pauw et al., 2011) and English-Turkish (C ¸ akmak et al., 2012). 1 Morphological analysers are, however, difficult and time consuming to develop, and often relatively language spe- cific. Although the bootstrapping of morphological anal- ysers between related languages shows promise (Preto- rius and Bosch, 2009), in each case the construction of a language-specific lexicon is still required, which is a large amount of work. The Bantu language family is consid- ered resource-scarce, and methods that rely on technolo- gies such as morphological analyzers, will mostly be out of reach for languages in this family. We approach this problem by noting the fact that most lan- guages in the Bantu family have a preference for open syl- lables (Spinner, 2011) and that in our case, even a simple syllabification approach can roughly approximate morpho- logical segmentation. Hyman (2003) states that the open syllable structure of Proto-Bantu is reinforced by the ag- glutinative morphology. It is therefore possible to decom- pose words accurately for many Bantu languages into syl- lables in a straightforward way. If syllabification is useful for the task of word alignment (or indeed, any other task), it could be applicable to a large number of under-resourced languages. Indeed, some success has been demonstrated by Kettunen (2010) for an information retrieval task in several European languages. As far as we are aware, a syllable- based approach to alignment has not yet been implemented for Bantu languages. 2 Figure 2 displays the previous example pair but with the words split into syllables instead of morphemes. Note that long proper nouns cause oversegmentation in the syllabi- fication in comparison to its corresponding morphological segmentation. Since we have found this approach to work relatively well, we have, for the time being, decided not to segment the English texts morphologically. 1 Turkish is also agglutinative. 2 Some disjunctively written languages such as Northern Sotho (Griesel et al., 2010) and Tswana (Wilken et al., 2012), where the written words resemble syllables, have been involved in machine translation projects.