A Comparison of Text-Classification Techniques Applied to Arabic Text Ghassan Kanaan and Riyad Al-Shalabi Arab Academy for Banking and Financial Services, Amman, Jordan. E-mail: {ghkanaan, rshalabi}@aabfs.org Sameh Ghwanmeh Computer Engineering Department,Yarmouk University, Jordan. E-mail: sameh@yu.edu.jo Hamda Al-Ma’adeed Arab Academy for Banking and Financial Services, Amman, Jordan. E-mail: hamda.almaadeed@gmail.com Many algorithms have been implemented for the prob- lem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text-classification techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classified using the kNN, Rocchio, and naïve Bayes algorithms. The research results reveal that Naïve Bayes was the best performer, followed by kNN and Rocchio. Introduction Text classification (TC—also known as text categoriza- tion, or topic spotting) is the task of deciding whether a piece of text belongs to any of a set of prescribed classes. It goes at least back to the 1960s. This task, which falls at the crossroads of information retrieval (IR) and machine learning (ML), has witnessed huge interest in the last 10 years from researchers and developers alike (Sebastiani, 2005). With the amount of online information growing rapidly, the need for reliable automatic text categorization has increased. Text classification techniques are used, for example, to build per- sonalized net news filters, which learn about the news-reading preferences of a user. They are used to index news stories or guide a user’s search on the World Wide Web (Joachims, Received October 22, 2007; revised December 25, 2007; accepted December 27, 2007 © 2009 ASIS&T Published online 6 July 2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20832 1997; Lewis, 1991; Sebastiani, 2002). To facilitate the pro- cess of text classification, automatic classification schemes are required. The goal of text classification is to learn classi- fication schemes that can be used to classify text documents automatically (Guo, Wang, Bell, Bi, & Greer, 2004). Many algorithms have been implemented for the prob- lem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. Arabic text is different in nature to the English text, and preprocessing of Arabic text is more challenging. This paper aims to compare three different classification techniques on Arabic text: k near- est neighbor (example-based), Rocchio (profile-based), and naïve Bayes (parametric-based) classifiers. For the first two techniques, different weighting schemes will be compared in an attempt to find the most efficient combination of technique and weighting scheme. Text Classification Text classification may be formalized as the task of approx- imating the unknown target function : D × C {T, F } (which describes how documents ought to be classified, according to a supposedly authoritative expert) by means of a function : D × C {T, F } called the classifier, where C = {c1,..., c |C| } is a predefined set of categories and D is a (possibly infinite) set of documents. Depending on the appli- cation, TC may be either a single-label task (i.e., exactly one c i C must be assigned to each d j D), or a multilabel task (i.e., any number 0 n j |C| of categories may be assigned to a document d j D). A special case of single-label TC is binary TC, in which, given a category c i , each d j D must JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(9):1836–1844, 2009