Spelling Checker-based Language Identification for the Eleven Official South African Languages Wikus Pienaar, Dirk Snyman Centre for Text Technology (CTexT ® ) North-West University Potchefstroom, South Africa {Wikus.Pienaar;Dirk.Snyman}@nwu.ac.za Abstract—Language identification is often the first step when compiling corpora from web pages or other unstructured sources. In this paper, an effective and accurate method for identification of all eleven official South African languages is presented. The method is based on reusing commercial spelling checkers and consists of a multi-stage architecture that is described in detail. We describe the implementation of our method, as well as an optimisation technique that was applied to reduce the processing time of the language identifier. Evaluation results indicate significant improvements over the accuracy figures obtained by a baseline system. Keywords – Language identification; Spelling checkers; African Languages. I. INTRODUCTION The goal of a language identifier is to classify a text based on the language the text is written in [1]. In a multilingual environment like South Africa, with eleven official languages, some form of language identification is often needed when searching for new texts to include in corpora or to analyse texts for various natural language processing tasks. Stemming, grammatical analysis, POS-tagging, and a number of other tasks set language identification as a prerequisite because these tasks require the application of various language-specific algorithms [2]. Language identification is also a very important process when gathering parallel corpora for training statistical machine translation systems. In an effort to obtain parallel corpora for the development of machine translation systems for three South African language pairs in the Autshumato project [3], we used a web crawler to gather parallel texts from the websites of the South African government. These sites provide newsletters, circulars, forms and other useful information regarding local and national government to citizens in all eleven official languages. A problem that we encountered during the processing of retrieved data is that the language of a file/document can often not be determined on the basis of the file name alone. This is due to the naming convention not always being consistent with regard to the indication of the language contained in the file, no indication is present or the language indication is erroneous, which leaves the language contained therein unclear. Furthermore, a text file may contain sentences from different languages. This complicates the language identification process, making it more time consuming and expensive with regards to human resources, because each file needs to be individually checked and classified by hand as containing a certain language, or in some cases, more than one language. Therefore we propose the development of a system that uses existing resources to automate this language identification process. By automating this classification task, we could save both time and monetary expenses. If the task is automated, we reduce the amount of human input which is an expensive undertaking. We plan to further limit the cost by using existing resources, therefore eliminating the need to develop new, expensive resources. II. RELATED WORK A. Approaches to Language Identification Padró and Padró [4] present two main approaches to language identification: Identification based on linguistic information and statistically based methods. These methods will be discussed shortly. The analysis of special characters and diacritics or letter sequences that are very frequent in a specific language can be used to perform language identification based on linguistic information [4]. These features can be obtained from corpus based examples. The approach based on diacritics and special characters proves problematic for our specific identification needs, due to the fact that languages such as English and isiZulu do not contain any diacritics or special characters. This could cause confusion in a system based on this approach, because the absence of these characters would point to both English and isiZulu, but without any means of disambiguation. The same holds true for other languages where these characters are also absent and for languages that are closely related that share the same characters (such as Setswana and Sepedi which both contain the š character). Dunning [5] points out that the sequence of letters approach only depends on a few unique occurrences of letters and disregards any other information contained in the text that could contribute to the classification. This approach holds intuitive merit but based on this merit alone, is not sufficient to determine the outcome of classification. Neither Padró and Padró [4], nor Dunning [5] reports on any results for a language identification system based on linguistic information.