Compression-Based Arabic Text Classification Haneen Ta’amneh, Ehsan Abu Keshek, Manar Bani Issa, Mahmoud Al-Ayyoub, Yaser Jararweh Jordan University of Science and Technology Irbid, Jordan Emails: {hjalitamnah10, eaabukeshek11, mbbanyissa10}@cit.just.edu.jo, {maalshbool, yijararweh}@just.edu.jo Abstract—Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages such as En- glish and Portuguese and it is shown to have certain advan- tages/disadvantages compared with word-based approaches. This work applies CTC on the Arabic language with the purpose of investigating whether these advantages/disadvantages exists for the Arabic language as well. The results are encouraging as they show the viability of using CTC for Arabic TC. I. I NTRODUCTION The text classification (TC) problem is concerned with automatically placing text documents in categories/classes based on their contents. It is one of the fundamental problems in many fields such as text mining, machine learning, natural language processing, information retrieval, etc., with a vast range of applications such as spam filtering [1], sentiment analysis [2], [3], [4], [5], determining author’s characteristics such as identity [6], [7], [8], gender [9], [10], dialect [11], [12], native language [13], political orientation [14], [15], etc. The TC problem gained more importance due to the explosion in the size of text data available on the Web over the past two decades. Not only this expansion forced people to consider scalability issues (giving rise to important fields such as Big Data), it has also produced special challenges for the TC problem. In general TC, we are given a large-enough dataset of manually labeled training documents and the objective is to build a classification model based on this dataset capable of accurately predicting the class of an unlabeled document. While this description applies to all supervised learning prob- lems, TC has some special characteristics requiring special attention. For example, most works on TC start by applying some text preprocessing tasks followed by employing a word- based approach for feature extraction. For example, they may tokenize the article and apply stemming followed by step words removal. Then they may use word occurrences in each article to build a feature vector for it in what is known as the bag-of-words (BOW) approach [16]. Such an approach relies heavily on word-based features (which are language- dependent) such as tokenization, stemming, etc., and tend to ignore word order, contextual information and other non-word features such as punctuation marks and features spanning more than one word. Moreover, it tends to generate feature vectors consisting of thousands of features even for relatively small and restricted datasets. Thus, a feature selection algorithm has to be applied to determine which features to keep due to their “discriminating” power. Finally, according to Frank et al. [17], it has to deal with issues like how to define a “word,” what to do with numbers and other non-alphabetic strings, and whether to apply stemming. These issues give rise to alternative approaches to TC such as compression-based TC (CTC). According to Marton et al. [18], CTC has been heavily studied to explore its advantages/disadvantages compared with traditional word-based TC approaches. Examples of the advan- tages include the ease of application, the lack of dependence on the often heavy text preprocessing steps, the ability to cap- ture non-word features, etc. On the other hand, the disadvan- tages include the poor performance by some CTC approaches in terms of accuracy and computational complexity. Most of the such works are for the English language. The objective of this work is to explore these advantages/disadvantages for the Arabic language. To the best of our knowledge, there has been no previous works on CTC of Arabic document despite the fact that CTC has been heavily studies for English documents over the past three or four decades [17]. This work is especially important to draw attention to CTC as an alternative option to perform various TC tasks such as spam filtering, authorship authentication, language/dilect identification, etc. Taking into consideration the significance of TC and the reliance of the traditional TC algorithms on language-dependent tools which do not perform on Arabic text as well as they perform on English text, one would see the need to explore language independent options. The languages of choice in this work is Arabic. Most existing works on NLP in general consider English text from text processing tools to optimized classifiers. Arabic, on the other hand, is largely understudied despite being one of the six official languages of the UN and the native language of 420 million people living in the Arab world, which spans regions of the Middle East and North Africa (MENA) in addition to parts of East Africa (Horn of Africa) [19]. Moreover, the amount of Arabic content on the Web and the number of Arabic speaking users are growing rapidly [5], [20]. Finally, Arabic is a rich morphological language with many challenging aspects. The importance of the Arabic language and the interesting challenges associated with studying it make it one of the most appealing languages to study. The rest of this paper is organized as follows. In Section II, 978-1-4799-7100-8/14/$31.00 ©2014 IEEE 594