Data-Driven Part-of-Speech Tagging of Kiswahili Guy De Pauw 1 , Gilles-Maurice de Schryver 2,3 , and Peter W. Wagacha 4 1 CNTS - Language Technology Group, University of Antwerp, Belgium 2 African Languages and Cultures, Ghent University, Belgium 3 Xhosa Department, University of the Western Cape, South Africa 4 School of Computing and Informatics, University of Nairobi, Kenya Abstract. In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers. We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting. This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general. 1 Introduction It is well-known that Part-of-Speech (POS) taggers are crucial components in the develop- ment of any serious application in the fields of Computational Linguistics (CL), Natural Lan- guage Processing (NLP) or Human Language Technology (HLT). While great strides have been made for (major) Indo-European languages such as English, Dutch and German, work on the Bantu languages is scarcely out of the egg. The Bantu languages - of which there are roughly five to six hundred - are basically agglutinating in nature, are characterized by a nominal class system and concordial agreement, and are spoken from an imaginary line north of the Democratic Republic of the Congo all the way down to the southern tip of the African continent. A particularly active region with regard to work on POS taggers for the Bantu languages is South(ern) Africa, but so far the projects have unfortunately not gone much beyond the development of (proposed) tagsets and, in some cases, prototype modules for morphological analysis. In this regard, the EAGLES tagset was adjusted for Setswana [1], a different tagset and suggestions to venture into Transformation-Based Tagging were presented for isiXhosa [2], yet another tagset and a combination of rule-based symbolic tagging and statistical tagging were offered as a corpus-processing tool for Sesotho sa Leboa [3,4], and a prototype finite-state morphological analyzer was developed for isiZulu [5,4]. For Kiswahili — a Bantu language spoken by up to fifty million people in East Africa (which makes it one of the most widely spoken African languages) — the situation is markedly different. Close to two decades of work at the University of Helsinki resulted in a Petr Sojka, Ivan Kopeˇ cek and Karel Pala (Eds.): TSD 2006, LNAI 4188, pp. 197–204, 2006. c Springer-Verlag Berlin Heidelberg 2006