Feature Subset Selection for Arabic Document Categorization using BPSO-KNN Hamouda K.Chantar School of Mathematical and Computer Sciences Heriot-Watt University EDINBURGH, UK hamoudak77@yahoo.com David W. Corne School of Mathematical and Computer Sciences Heriot-Watt University EDINBURGH, UK dwcorne@gmail.com Abstract—Document categorization is an important topic that is central to many applications that demand reasoning about and organisation of text documents, web pages, and so forth. Document classification is commonly achieved by choosing appropriate features (terms) and building a term-frequency inerse-document frequency (TFIDF) feature vector. In this process, feature selection is a key factor in the accuracy and effectiveness of resulting classifications. For a given task, the right choice of features means accurate classification with suitable levels of computational efficiency. Meanwhile, most document classification work is based on English language documents. In this paper we make three main contributions: (i) we demonstrate successful document classification in the context of Arabic documents (although previous work has demonstrated text classification in Arabic, the datasets used, and the experimental setup, have not been revealed); (ii) we offer our datasets to enable other researchers to compare directly with our results; (iii) we demonstrate a combination of Binary PSO and K nearest neighbour that performs well in selecting good sets of features for this task. Keywords-component; feature selection, text mining,Arabic language processing I. INTRODUCTION With rapid growth in the availability and use of natural language text documents in electronic form, automatic text classification becomes an important technique for understanding and organizing these data. Text categorization or ‘topic spotting’ is the task of classifying (largely unstructured) natural language documents into one or more pre-defined categories based on their content. The ability to do this supports an increasing number of applications, including more informative search engine interfaces, and replacing very time- consuming human effort in the manual organization of large collections of text documents. The basis of document/text processing is to transform a document into a term-frequency vector [1], but this immediately brings up the issue of what terms, and how many terms, to use to represent a document. This general question of feature selection (FS) has a great impact in data mining in general and text mining in particular. FS has been an active research area since the 1970s. In text classification in particular, feature selection aims to improve the classification accuracy and computational efficiency by removing irrelevant and redundant terms (features), while retaining features that contain sufficient information to assist with the classification task at hand. There are broadly two approaches to FS, the wrapper and filter approaches [2]. In the wrapper approach, typically a search is performed for an ideal subset of features, using the accuracy of classifiers (given those features) as a guide to evaluating an individual feature subset. In the filter approach, a subset of features is selected using a priori feature scoring metrics – e.g. in the text categorization field features are ranked and selected in this way using metrics such as document frequency, information gain, mutual information and so forth [1,2]. Generally, the wrapper approach is beneficial since it considers how well a group of features work together, and thus can implicitly detect and exploit nonlinear interactions among large subsets of features; however wrapper approaches are relatively slow. Meanwhile, filter approaches always have the danger of missing such interactions between two or more features, and may often discard features that may be highly relevant to the classification task. In this paper we choose a wrapper approach, since we are mostly interested in developing accurate classifiers (e.g. to support a tool that post-processes the results from an Arabic search engine), and in that context it is not critical that the time spent developing the tool be particularly fast. Finally we note some basic differences between Arabic and English. Arabic has 28 letters and is written from right to left. In contrast with English, Arabic has a richer morphology that makes developing automatic processing systems for it a highly challenging task [3]. The basic nature of the language, in the context of text classification, is similar to English in that we can hope to rely on the frequency distributions of ‘content terms’ to underpin the development of automatic text categorisation. However, the large degree of inflections, word gender, and pluralities (Arabic has forms for singular, dual, and plural), means the pre-processing (e.g. stemming) stage is more complex than in the English case. The remainder is set out as follows. In section II we briefly overview related work on Arabic text categorization. This essentially provides a list of indicative performance values (in terms of accuracy or F1-measure) for such work, and points towards the more promising approaches,