Abstract—Text categorization is the problem of classifying text documents into a set of predefined classes. After a preprocessing step, the documents are typically represented as large sparse vectors. When training classifiers on large collections of documents, both the time and memory restrictions can be quite prohibitive. This justifies the application of feature selection methods to reduce the dimensionality of the document-representation vector. In this paper, three feature selection methods are evaluated: Random Selection, Information Gain (IG) and Support Vector Machine feature selection (called SVM_FS). We show that the best results were obtained with SVM_FS method for a relatively small dimension of the feature vector. Also we present a novel method to better correlate SVM kernel’s parameters (Polynomial or Gaussian kernel). KeywordsFeature Selection, Learning with Kernels, Support Vector Machine, and Classification. I. INTRODUCTION HILE more and more textual information is available online, effective retrieval is difficult without good indexing and summarization of document content. Document categorization is one solution to this problem. In recent years a growing number of categorization methods and machine learning techniques have been developed and applied in different contexts. Documents are typically represented as vectors in a features space. Each word in the vocabulary is represented as a separate dimension. The number of occurrences of a word in a document represents the value of the corresponding component in the document’s vector. This document representation results in a huge dimensionality of the feature space, which poses a major problem to text categorization. The native feature space consists of the unique terms that occur into the documents, which can be tens or hundreds of thousands of terms for even a moderate-sized text collection. Due to the large dimensionality, much time and memory are needed for training a classifier on a large collection of documents. For this reason we explore various methods to Manuscript received June 22, 2006. D. Morariu is with the Faculty of Engineering, “Lucian Blaga” University of Sibiu, Computer Science Department, E. Cioran Street, No. 4, 550025 Sibiu, Romania, (phone: 40/0740/092202; e-mail: daniel.morariu@ulbsibiu.ro). L. Vintan is with the Faculty of Engineering, “Lucian Blaga” University of Sibiu, Computer Science Department, E. Cioran Street, No. 4, 550025 Sibiu, Romania, (e-mail: lucian.vintan@ulbsibiu.ro). V. Tresp is with the Siemens AG, Information and Comunications, 81739 Munchen, Germany (e-mail: volker.tresp@siemens.com). reduce the feature space and thus the response time. As we’ll show the categorization results are better when we work with a smaller optimized dimension of the feature space. As the feature space grows, the accuracy of the classifier doesn’t grow significantly; actually it even can decreases due to noisy vector elements. This paper represents a comparative study of feature selection methods used prior to documents classifications (Random Selection, Information Gain [4] and SVM feature selection [5]). Also we studied the influence of the input data representation on classification accuracy. We have used three type of representation, Binary, Nominal and Cornell Smart. For the classification process we used the Support Vector Machine technique, which has proven to be efficient for nonlinearly separable input data [8], [9], [11]. The Support Vector Machine (SVM) is actually based on learning with kernels. A great advantage of this technique is that it can use large input data and feature sets. Thus, it is easy to test the influence of the number of features on classification accuracy. We implemented SVM classification for two types of kernels: “polynomial kernel” and “Gaussian kernel” (Radial Basis Function - RBF). We will use a simplified form of the kernels by correlating the parameters. We have also modified this SVM representation so that it can be used as a method of features selection in the text-mining step. Section 2 and 3 contain prerequisites for the work that we present in this paper. In section 4 we present the framework and the methodology used for our experiments. Section 5 presents the main results of our experiments. The last section debates and concludes on the most important obtained results and proposes some further work. II. FEATURE SELECTION METHODS A substantial fraction of the available information is stored in text or document databases which consist of a large collection of documents from various sources such as news articles, research papers, books, web pages, etc. Data stored in text format is considered semi-structured data that means neither completely unstructured nor completely structured. In text categorization, feature selection is typically performed by assigning a score or a weight to each term and keeping some number of terms with the highest scores while discarding the rest. After this, experiments evaluate the effects that feature selection has on both the classification performance and the response time. Numerous feature scoring measures have been proposed Daniel Morariu, Lucian N. Vintan, and Volker Tresp Feature Selection Methods for an Improved SVM Classifier W TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 2006 ISSN 1305-5313 ENFORMATIKA V14 2006 ISSN 1305-5313 83 © 2006 WORLD ENFORMATIKA SOCIETY