54 International Journal of Information Retrieval Research, 1(3), 54-70, July-September 2011
Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Keywords: Arabic Text Classification, Decision Tree, Naïve Bayes Classifier (NB), Natural Language
Processing, Stemming, Support Vector Machine (SVM), Text Classification
1. INTRODUCTION
The tremendous growth of available Arabic
text documents on the Web and databases have
posed a major challenge on researchers to find
better ways to deal with such huge amount of
information in order to enable search engines
and information retrieval systems to provide
relevant information accurately, which has
become a crucial task to satisfy the needs of
different end users.
Text classifications, and its techniques,
have become a major tool for dealing with
the large amount of available data on the Web
and databases. Text classification is the task of
automatically assigning text documents to one
or more predefined categories based on content
and linguistic features (Gharib et al., 2009;
The Effect of Stemming on
Arabic Text Classification:
An Empirical Study
Abdullah Wahbeh, Yarmouk University, Jordan
Mohammed Al-Kabi, Yarmouk University, Jordan
Qasem Al-Radaideh, Yarmouk University, Jordan
Emad Al-Shawakfa, Yarmouk University, Jordan
Izzat Alsmadi, Yarmouk University, Jordan
ABSTRACT
The information world is rich of documents in different formats or applications, such as databases, digital
libraries, and the Web. Text classification is used for aiding search functionality offered by search engines and
information retrieval systems to deal with the large number of documents on the web. Many research papers,
conducted within the field of text classification, were applied to English, Dutch, Chinese, and other languages,
whereas fewer were applied to Arabic language. This paper addresses the issue of automatic classification or
classification of Arabic text documents. It applies text classification to Arabic language text documents using
stemming as part of the preprocessing steps. Results have showed that applying text classification without
using stemming; the support vector machine (SVM) classifier has achieved the highest classification accuracy
using the two test modes with 87.79% and 88.54%. On the other hand, stemming has negatively affected the
accuracy, where the SVM accuracy using the two test modes dropped down to 84.49% and 86.35%.
DOI: 10.4018/ijirr.2011070104