A Comparison of Text-Classiﬁcation Techniques Applied to Arabic Text Ghassan Kanaan and Riyad Al-Shalabi Arab Academy for Banking and Financial Services, Amman, Jordan. E-mail: {ghkanaan, rshalabi}@aabfs.org Sameh Ghwanmeh Computer Engineering Department,Yarmouk University, Jordan. E-mail: sameh@yu.edu.jo Hamda Al-Ma’adeed Arab Academy for Banking and Financial Services, Amman, Jordan. E-mail: hamda.almaadeed@gmail.com Many algorithms have been implemented for the prob- lem of text classiﬁcation. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text-classiﬁcation techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classiﬁed using the kNN, Rocchio, and naïve Bayes algorithms. The research results reveal that Naïve Bayes was the best performer, followed by kNN and Rocchio. Introduction Text classiﬁcation (TC—also known as text categoriza- tion, or topic spotting) is the task of deciding whether a piece of text belongs to any of a set of prescribed classes. It goes at least back to the 1960s. This task, which falls at the crossroads of information retrieval (IR) and machine learning (ML), has witnessed huge interest in the last 10 years from researchers and developers alike (Sebastiani, 2005). With the amount of online information growing rapidly, the need for reliable automatic text categorization has increased. Text classiﬁcation techniques are used, for example, to build per- sonalized net news ﬁlters, which learn about the news-reading preferences of a user. They are used to index news stories or guide a user’s search on the World Wide Web (Joachims, Received October 22, 2007; revised December 25, 2007; accepted December 27, 2007 © 2009 ASIS&T • Published online 6 July 2009 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/asi.20832 1997; Lewis, 1991; Sebastiani, 2002). To facilitate the pro- cess of text classiﬁcation, automatic classiﬁcation schemes are required. The goal of text classiﬁcation is to learn classi- ﬁcation schemes that can be used to classify text documents automatically (Guo, Wang, Bell, Bi, & Greer, 2004). Many algorithms have been implemented for the prob- lem of text classiﬁcation. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. Arabic text is different in nature to the English text, and preprocessing of Arabic text is more challenging. This paper aims to compare three different classiﬁcation techniques on Arabic text: k near- est neighbor (example-based), Rocchio (proﬁle-based), and naïve Bayes (parametric-based) classiﬁers. For the ﬁrst two techniques, different weighting schemes will be compared in an attempt to ﬁnd the most efﬁcient combination of technique and weighting scheme. Text Classiﬁcation Text classiﬁcation may be formalized as the task of approx- imating the unknown target function : D × C → {T, F } (which describes how documents ought to be classiﬁed, according to a supposedly authoritative expert) by means of a function : D × C → {T, F } called the classiﬁer, where C = {c1,..., c |C| } is a predeﬁned set of categories and D is a (possibly inﬁnite) set of documents. Depending on the appli- cation, TC may be either a single-label task (i.e., exactly one c i ∈ C must be assigned to each d j ∈ D), or a multilabel task (i.e., any number 0 ≤ n j ≤ |C| of categories may be assigned to a document d j ∈ D). A special case of single-label TC is binary TC, in which, given a category c i , each d j ∈ D must JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(9):1836–1844, 2009