Journal of Computer Science 10 (11): 2232-2239, 2014
ISSN: 1549-3636
© 2014 A. Adel et al., This open access article is distributed under a Creative Commons Attribution
(CC-BY) 3.0 license
doi:10.3844/jcssp.2014.2232.2239 Published Online 10 (11) 2014 (http://www.thescipub.com/jcs.toc)
Corresponding Author: Nazlia Omar, Knowledge Technology Group, Centre for AI Technology,
Faculty of Information Science and Technology, University Kebangsaan Malaysia,
43600 Bangi, Selangor, Malaysia
2232 Science Publications JCS
A COMPARATIVE STUDY OF COMBINED FEATURE
SELECTION METHODS FOR ARABIC TEXT
CLASSIFICATION
Aisha Adel, Nazlia Omar and Adel Al-Shabi
Knowledge Technology Group, Centre for AI Technology, Faculty of Information Science and Technology,
University Kebangsaan Malaysia, 43600 Bangi, Selangor, Malaysia
Received 2014-04-09; Revised 2014-10-15; Accepted 2014-11-11
ABSTRACT
Text classification is a very important task due to the huge amount of electronic documents. One of the
problems of text classification is the high dimensionality of feature space. Researchers proposed many
algorithms to select related features from text. These algorithms have been studied extensively for English text,
while studies for Arabic are still limited. This study introduces an investigation on the performance of five
widely used feature selection methods namely Chi-square, Correlation, GSS Coefficient, Information Gain and
Relief F. In addition, this study also introduces an approach of combination of feature selection methods based
on the average weight of the features. The experiments are conducted using Naïve Bayes and Support Vector
Machine classifiers to classify a published Arabic corpus. The results show that the best results were obtained
when using Information Gain method. The results also show that the combination of multiple feature selection
methods outperforms the best results obtain by the individual methods.
Keywords: Feature Selection, Combination Method, Arabic Text Classification
1. INTRODUCTION
With the rapid growth of the Internet, the volume of
the news and information available on the web is growing
exponentially. Since there has been an explosion of
information available on the Internet, this makes the
process of analyzing and processing them manually a very
difficult task. As a consequence, text classification has
gained importance in hierarchical organization of these
documents. The fundamental goal of the text classification
is to classify texts into appropriate classes.
One of the problems of text classification is the huge
number of features which reduce the performance of text
classification and consume the time. Feature selection
method is used to reduce the feature space by selecting
the most relevant features (Maldonado and L’Huillier,
2013). Many feature selection methods have been
proposed and investigated to improve the performance of
English text classification. However, the work on feature
selection for Arabic language are limited and most of
studies in text classification for Arabic language are
concerned with investigating the efficiency of text
classification algorithms without enough attention to
how the feature selection task can improve the accuracy
of classification (Al-Salemi and Ab Aziz, 2010;
Hawashin et al. 2013; Saad, 2011).
Our motivation to do this research is to enhance the
robustness of the finally selected feature subsets of the class
and get rid of the noisy and redundant features because
there is another subset which supplies the same information