Finite-State Computational Morphology - Treatment of the Zulu Noun L. Pretorius a S. E. Bosch b a Department of Computer Science and Information Systems, University of South Africa, Pretoria, South Africa, pretol@unisa.ac.za b Department of African Languages, University of South Africa, Pretoria, South Africa, boschse@unisa.ac.za Abstract Morphological analysis is a basic enabling application for further kinds of natural language processing, including part-of-speech tagging, parsing, translation and other high-level applications. Automated morphological analyz- ers exist for many of the European languages, but have not been reported for any of the indigenous languages of southern Africa. Our project in computational morphological analysis/generation includes the production of an automated morphological analyzer/generator for Zulu, using finite-state methods and tools. In this paper we elaborate on the use of finite-state methods in computational morphology, and report on our treatment of the Zulu noun. Keywords: natural language processing, computational morphology, finite-state technology, morphological anal- ysis, agglutinating languages, Zulu, noun Computing Review Categories: I.2.7 1 Introduction Advances in research and in the production of sophis- ticated applications in natural language processing of- ten rely on automated morphological analysis. Such applications include, for example tokenization, part- of-speech tagging, shallow syntactic parsing, and ma- chine translation [6, 8, 9]. Computational aids for morphological analysis already exist for many Euro- pean languages, including English, French, German, Spanish, Portuguese and Italian, while significant work has already been done for Basque, Turkish, Arabic, Finnish, Swedish, Norwegian, Danish, several East European languages, for example Hungarian, as well as for Swahili, a member of the Bantu language fam- ily 1 (see for example [4]). The status quo according to [8, p. 96] is that morphological analyzers still remain to be writ- ten for all but the commercially most important languages. This is also the case for Zulu and the majority of other languages in the Bantu language family which up to this stage have not received much attention in terms of natural language processing. The pro- cessing of these languages, which are characterized 1 In linguistic studies “Bantu language family” refers to a specific family of languages which is spoken on the southern half of the African continent, the individual languages of which share common linguistic features. The term “Bantu” was introduced in language studies as far back as 1857 by a German philologist, WHI Bleek. by complex morphological structures, particularly re- quires specialized tools for the automatic analysis of word-forms, as well as for most other electronic corpus- based analyses (see for example [14]). Against this background, the aim of this article is to show how the challenges posed by an automated morphological analysis of a language such as Zulu can be addressed within the framework of finite-state meth- ods. For the purposes of this discussion, we restrict ourselves to the morphological analysis of the word category noun in Zulu. Section 2 of this paper provides a short exposition of the morphological structure of the Zulu noun and the language-specific challenges posed by a computa- tional analysis of this word category. Section 3 gives an overview of some aspects of finite-state methods and tools that are relevant in computational linguis- tics in general, and in computational morphology in particular. In section 4 the use of finite-state technol- ogy (i.e. methods and tools) in automating the mor- phological analysis of the Zulu noun is demonstrated by means of a simple example. Finally, concluding re- marks and proposals concerning future research pos- sibilities are made. 2 The morphological structure of the noun in Zulu The noun in Zulu is made up of two parts, namely a noun prefix and a noun stem. Nouns are typically categorized into eighteen noun classes, as determined 1