Compression-Based Arabic Text Classification
Haneen Ta’amneh, Ehsan Abu Keshek, Manar Bani Issa, Mahmoud Al-Ayyoub, Yaser Jararweh
Jordan University of Science and Technology
Irbid, Jordan
Emails: {hjalitamnah10, eaabukeshek11, mbbanyissa10}@cit.just.edu.jo, {maalshbool, yijararweh}@just.edu.jo
Abstract—Text classification (TC) is one of the fundamental
problems in text mining. Plenty of works exist on TC with
interesting approaches and excellent results; however, most of
these works follow a word-based approach for feature extraction.
In this work, we are interested in an alternative (byte-based
or character-based) approach known as compression-based TC
(CTC). CTC has been used for some languages such as En-
glish and Portuguese and it is shown to have certain advan-
tages/disadvantages compared with word-based approaches. This
work applies CTC on the Arabic language with the purpose of
investigating whether these advantages/disadvantages exists for
the Arabic language as well. The results are encouraging as they
show the viability of using CTC for Arabic TC.
I. I NTRODUCTION
The text classification (TC) problem is concerned with
automatically placing text documents in categories/classes
based on their contents. It is one of the fundamental problems
in many fields such as text mining, machine learning, natural
language processing, information retrieval, etc., with a vast
range of applications such as spam filtering [1], sentiment
analysis [2], [3], [4], [5], determining author’s characteristics
such as identity [6], [7], [8], gender [9], [10], dialect [11],
[12], native language [13], political orientation [14], [15], etc.
The TC problem gained more importance due to the explosion
in the size of text data available on the Web over the past two
decades. Not only this expansion forced people to consider
scalability issues (giving rise to important fields such as Big
Data), it has also produced special challenges for the TC
problem.
In general TC, we are given a large-enough dataset of
manually labeled training documents and the objective is to
build a classification model based on this dataset capable
of accurately predicting the class of an unlabeled document.
While this description applies to all supervised learning prob-
lems, TC has some special characteristics requiring special
attention. For example, most works on TC start by applying
some text preprocessing tasks followed by employing a word-
based approach for feature extraction. For example, they may
tokenize the article and apply stemming followed by step
words removal. Then they may use word occurrences in each
article to build a feature vector for it in what is known as
the bag-of-words (BOW) approach [16]. Such an approach
relies heavily on word-based features (which are language-
dependent) such as tokenization, stemming, etc., and tend to
ignore word order, contextual information and other non-word
features such as punctuation marks and features spanning more
than one word. Moreover, it tends to generate feature vectors
consisting of thousands of features even for relatively small
and restricted datasets. Thus, a feature selection algorithm
has to be applied to determine which features to keep due
to their “discriminating” power. Finally, according to Frank
et al. [17], it has to deal with issues like how to define a
“word,” what to do with numbers and other non-alphabetic
strings, and whether to apply stemming. These issues give rise
to alternative approaches to TC such as compression-based TC
(CTC).
According to Marton et al. [18], CTC has been heavily
studied to explore its advantages/disadvantages compared with
traditional word-based TC approaches. Examples of the advan-
tages include the ease of application, the lack of dependence
on the often heavy text preprocessing steps, the ability to cap-
ture non-word features, etc. On the other hand, the disadvan-
tages include the poor performance by some CTC approaches
in terms of accuracy and computational complexity. Most of
the such works are for the English language. The objective of
this work is to explore these advantages/disadvantages for the
Arabic language. To the best of our knowledge, there has been
no previous works on CTC of Arabic document despite the fact
that CTC has been heavily studies for English documents over
the past three or four decades [17]. This work is especially
important to draw attention to CTC as an alternative option to
perform various TC tasks such as spam filtering, authorship
authentication, language/dilect identification, etc. Taking into
consideration the significance of TC and the reliance of the
traditional TC algorithms on language-dependent tools which
do not perform on Arabic text as well as they perform on
English text, one would see the need to explore language
independent options.
The languages of choice in this work is Arabic. Most
existing works on NLP in general consider English text from
text processing tools to optimized classifiers. Arabic, on the
other hand, is largely understudied despite being one of the six
official languages of the UN and the native language of 420
million people living in the Arab world, which spans regions of
the Middle East and North Africa (MENA) in addition to parts
of East Africa (Horn of Africa) [19]. Moreover, the amount
of Arabic content on the Web and the number of Arabic
speaking users are growing rapidly [5], [20]. Finally, Arabic is
a rich morphological language with many challenging aspects.
The importance of the Arabic language and the interesting
challenges associated with studying it make it one of the most
appealing languages to study.
The rest of this paper is organized as follows. In Section II,
978-1-4799-7100-8/14/$31.00 ©2014 IEEE
594