Text Classification and Multilinguism: Getting at Words via N-grams of Characters Ismaïl Biskri & Sylvain Delisle Université du Québec à Trois Rivières Département de mathématiques et d’informatique Trois-Rivières, Québec, Canada, G9A 5H7 www.uqtr.ca/{~biskri, ~delisle} ABSTRACT Genuine numerical multilingual text classification is almost impossible if only words are treated as the privileged unit of information. Although text tokenization (in which words are considered as tokens) is relatively easy in English or French, it is much more difficult for other languages such as German or Arabic. Moreover, stemming, typically used to normalize and reduce the size of the lexicon, constitutes another challenge. The notion of N-grams of words (i.e. sequences of N words, with N typically equals to 2, 3 or 4) which, for the last ten years, seems to have produced good results both in language identification and speech analysis, has recently become a privileged research axis in several areas of knowledge extraction from text. In this paper, we present a text classification software based on N-grams of characters (not words), evaluate its results on documents containing text written in English and French, and compare these results with those obtained from a different classification tool based exclusively on the processing of words. An interesting feature of our software is that it does not need to perform any language-specific processing and is thus appropriate for multilingual text classification. Keywords: Numerical text classification, N-grams, (multilingual) natural language processing, knowledge extraction, text databases. 1. TOKENS, WORDS AND CHARACTERS When processing a large corpus with a statistical tool (see, amongst others, [2] and [16]), the first phase typically consists of subdividing the text into information units called tokens. These tokens usually correspond to words, at least for the most part of them—“non words” tokens could be pictures, numbers, special characters or symbols. This tokenization process may appear to be quite simple, not to say trivial—tokenization, morphological analysis, and lexicons are discussed in [17], in the context of corpus-based processing. However, from an automated processing point of view, the implementation of this process constitutes a challenge. Indeed, how to reliably recognize words? What are the unambiguous formal surface markers that can delineate words, i.e. their boundaries? These questions are relatively easy to answer for languages such as French or English: basically, any string of characters delimited by a beginning space and an ending space is a simple word. But for many other languages, such as German or Arabic, the answer is much more complicated. For instance, this long token lebensversicherungsgesellschaftsangestellter is a term corresponding to “life insurance company employee”—so this is a compound noun, not a word. In Arabic, subject pronouns and complements are sometimes attached (concatenated) to the verb. In this case, a token like kathabthouhou corresponds in fact to a sentence (here, “I wrote it” or “I’ve written it”). Obviously, the simple notion of tokens defined as strings of characters separated by spaces is an oversimplification that is highly inadequate for many situations and languages (see [13]). Considering the above, what then could constitute a reasonable atomic unit of information for the segmentation of text, independently of the specific language it is written in? Balpe et al. [1] suggest that this unit should be defined according to the goal we set ourselves when reading or processing a text. More precisely, from a numerical classification-based knowledge extraction viewpoint, the definition of the basic unit of information to be considered depends on the following: • The unit of information must be a portion of the input text submitted to the numerical analysis processor (a numerical classifier as far as we are concerned in this paper); • From an automated processing point of view, it should be easy to recognize these units of information; • The definition of the unit of information should be independent of the specific language the text is written in;