S.Afr.J.Afr.Lang., 2004, 1 57 Spellcheckers for the South African languages, Part 1: The status quo and options for improvement Gilles-Maurice de Schryver * Department of African Languages and Cultures, Ghent University, Rozier 44, B-9000 Ghent, Belgium Department of African Languages, University of Pretoria, Pretoria 0002, South Africa E-mail: gillesmaurice.deschryver@UGent.be DJ Prinsloo Department of African Languages, University of Pretoria, Pretoria 0002, South Africa E-mail: danie.prinsloo@up.ac.za May 2003 In this article an annotated diachronic overview is presented of the field of spelling and grammar checkers with specific reference to the underlying computational techniques. Where appropriate, the various methods are illustrated with data drawn from the official South African languages. The performance of the current South African spellcheckers is subsequently studied, which leads to the conclusion that improvements are needed for especially the Nguni group. Various potential future options to that intent are then looked into. Most illustrations and calculations are carried out on an authentic set of parallel texts entitled “What is the African National Congress?” (ANC, [sa]). Spellcheckers for the South African languages: Genesis and beyond Spellcheckers for the South African languages were first developed by D.J. Prinsloo at the end of the 1990s, and proofing tools containing components for isiXhosa, isiZulu, Sesotho sa Leboa and Setswana were made available in WordPerfect 9, within the WordPerfect Office 2000 suite. Basically, each of those spellcheckers consisted of around thirty thousand top-frequency orthographic words only (Prinsloo and De Schryver, 2001:129). In 2003, corpus-based spellcheckers, commissioned by the Department of Arts and Culture (DAC), were released for all official South African languages – except for (South African) English (Prinsloo and De Schryver, 2003a). In this second attempt the wordlists are typically several hundreds of thousands of words long, wordlists that can simply be loaded as ‘custom dictionaries’ into commercial word processors such as Microsoft Word. In Prinsloo and De Schryver (2003b) it is shown that wordlists consisting of a few hundred thousand valid orthographic words successfully push the recall of correctly written text – the so-called lexical recall – up to 99% for the disjunctively written African languages (Sesotho sa Leboa, Sesotho, Setswana, Xitsonga and Tshivenda), 1 as well as for Afrikaans. Non-words are thus rather easily picked up for these languages, with only minor confusion as a result of valid words also being flagged as misspellings. When this same approach is followed for the conjunctively written African languages (i.e. the Nguni group: isiXhosa, isiZulu, isiNdebele and siSwati), however, even lexica containing up to half a million orthographic words do not result in lexical recall values higher than 90%. In the present article, the first in a series of two, the scene is set for the introduction of the utilisation of so-called ‘clusters of circumfixes’ (the topic of Part 2, cf. Prinsloo and De Schryver, 2004), which are aimed at improving the lexical recall of especially the Nguni group of spellcheckers. In order to fully appreciate the need for the utilisation of such clusters of circumfixes, the present article (Part 1) consists of three main sections laying the groundwork. Firstly, a diachronic overview of the various algorithms and techniques that * Author to whom correspondence should be addressed.