Workshop on computer science and information technologies CSIT’2010, Moscow – Saint-Petersburg, Russia, 2010 75 Arabic Text Classification Using Decision Trees Motaz K. Saad Computer Engineering Department Islamic University of Gaza Gaza, Palestine e-mail: msaad@iugaza.edu.ps Wesam Ashour Computer Engineering Department Islamic University of Gaza Gaza, Palestine e-mail: washour@iugaza.edu.ps Abstract 1 Text mining draw more and more attention recently, it has been applied on different domains including web mining, opinion mining, and sentiment analysis. Text pre-processing is an important stage in text mining. The major obstacle in text mining is the very high dimensionality and the large size of text data. Natural language processing and morphological tools can be employed to reduce dimensionality and size of text data. In addition, there are many term weighting schemes available in the literature that may be used to enhance text representation as feature vector. In this paper, we study the impact of text pre-processing and different term weighting schemes on Arabic text classification. In addition, develop new combinations of term weighting schemes to be applied on Arabic text for classification purposes. 1. Introduction Text Mining is a vital process due to huge availability of information in text documents which exists in various format. However, the task is not trivial to make the text at human level understanding to machines. The process includes derive linguistic features from text to be at human like interpretation to be mined, particularly, for Arabic language. Text mining is well motivated, due to the fact that much of the world’s data can be found in text form (newspaper articles, emails, literature, web pages, etc). Mining text has the same goals as data mining including, text categorization, clustering, document summarization, and extracting useful knowledge/trends. Text mining must overcome a major difficulty that there is no explicit structure [4]. Machines can reason relational data well since schemas are explicitly available. Text, however, encodes all semantic information within natural language. Proceedings of the 12 th international workshop on computer science and information technologies CSIT’2010, Moscow – Saint-Petersburg, Russia, 2010 Text mining algorithms, then, must make some sense out of this natural language representation. Humans are great at doing this, but this has proved to be a problem for machines [4]. Text mining usually involves the process of structuring the input text (parsing, along with the addition of some derived linguistic features and the removal of others), deriving patterns within the structured data, and finally evaluation and interpretation of the output. High quality in text mining usually refers to some combination of relevance, novelty, and interestingness. Arabic is one of the most widely used languages in the world. It is spoken by more than 280 million people as a first language and by 250 million as a second language. Despite Arabic is wide language, there are relatively few studies on the retrieval/mining of Arabic text documents in the literature. This is due to the unique nature of Arabic language morphological principles. Arabic is a challenging language for a number of reasons [1, 2, 3, 5, 12]: 1. Orthographic with diacritics is less ambiguous and more phonetic in Arabic, certain combinations of characters can be written in different ways. 2. Arabic has a very complex morphology recording as compare to English language. 3. Broken plurals are common. Broken plurals are somewhat like irregular English plurals except that they often do not resemble the singular form as closely as irregular plurals resemble the singular in English. Because broken plurals do not obey normal morphological rules, they are not handled by existing stemmers. 4. In Arabic we have short vowels which give different pronunciation. Grammatically they are required but omitted in written Arabic texts. 5. Arabic synonyms are widespread. The impact of text pre-processing and different term weighting schemes combinations on Arabic text classification has not been studied in the literature. In this