A Comparison of Text-Classification Techniques
Applied to Arabic Text
Ghassan Kanaan and Riyad Al-Shalabi
Arab Academy for Banking and Financial Services, Amman, Jordan.
E-mail: {ghkanaan, rshalabi}@aabfs.org
Sameh Ghwanmeh
Computer Engineering Department,Yarmouk University, Jordan. E-mail: sameh@yu.edu.jo
Hamda Al-Ma’adeed
Arab Academy for Banking and Financial Services, Amman, Jordan.
E-mail: hamda.almaadeed@gmail.com
Many algorithms have been implemented for the prob-
lem of text classification. Most of the work in this area
was carried out for English text. Very little research has
been carried out on Arabic text. The nature of Arabic text
is different than that of English text, and preprocessing
of Arabic text is more challenging. This paper presents
an implementation of three automatic text-classification
techniques for Arabic text. A corpus of 1445 Arabic
text documents belonging to nine categories has been
automatically classified using the kNN, Rocchio, and
naïve Bayes algorithms. The research results reveal that
Naïve Bayes was the best performer, followed by kNN and
Rocchio.
Introduction
Text classification (TC—also known as text categoriza-
tion, or topic spotting) is the task of deciding whether a
piece of text belongs to any of a set of prescribed classes.
It goes at least back to the 1960s. This task, which falls at the
crossroads of information retrieval (IR) and machine learning
(ML), has witnessed huge interest in the last 10 years from
researchers and developers alike (Sebastiani, 2005). With
the amount of online information growing rapidly, the need
for reliable automatic text categorization has increased. Text
classification techniques are used, for example, to build per-
sonalized net news filters, which learn about the news-reading
preferences of a user. They are used to index news stories or
guide a user’s search on the World Wide Web (Joachims,
Received October 22, 2007; revised December 25, 2007; accepted December
27, 2007
© 2009 ASIS&T • Published online 6 July 2009 in Wiley InterScience
(www.interscience.wiley.com). DOI: 10.1002/asi.20832
1997; Lewis, 1991; Sebastiani, 2002). To facilitate the pro-
cess of text classification, automatic classification schemes
are required. The goal of text classification is to learn classi-
fication schemes that can be used to classify text documents
automatically (Guo, Wang, Bell, Bi, & Greer, 2004).
Many algorithms have been implemented for the prob-
lem of text classification. Most of the work in this area
was carried out for English text. Very little research has
been carried out on Arabic text. Arabic text is different
in nature to the English text, and preprocessing of Arabic
text is more challenging. This paper aims to compare three
different classification techniques on Arabic text: k near-
est neighbor (example-based), Rocchio (profile-based), and
naïve Bayes (parametric-based) classifiers. For the first two
techniques, different weighting schemes will be compared in
an attempt to find the most efficient combination of technique
and weighting scheme.
Text Classification
Text classification may be formalized as the task of approx-
imating the unknown target function : D × C → {T, F }
(which describes how documents ought to be classified,
according to a supposedly authoritative expert) by means of
a function : D × C → {T, F } called the classifier, where
C = {c1,..., c
|C|
} is a predefined set of categories and D is a
(possibly infinite) set of documents. Depending on the appli-
cation, TC may be either a single-label task (i.e., exactly one
c
i
∈ C must be assigned to each d
j
∈ D), or a multilabel task
(i.e., any number 0 ≤ n
j
≤ |C| of categories may be assigned
to a document d
j
∈ D). A special case of single-label TC is
binary TC, in which, given a category c
i
, each d
j
∈ D must
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 60(9):1836–1844, 2009