Abstract—Text categorization is the problem of classifying text
documents into a set of predefined classes. After a preprocessing
step, the documents are typically represented as large sparse vectors.
When training classifiers on large collections of documents, both the
time and memory restrictions can be quite prohibitive. This justifies
the application of feature selection methods to reduce the
dimensionality of the document-representation vector. In this paper,
three feature selection methods are evaluated: Random Selection,
Information Gain (IG) and Support Vector Machine feature selection
(called SVM_FS). We show that the best results were obtained with
SVM_FS method for a relatively small dimension of the feature
vector. Also we present a novel method to better correlate SVM
kernel’s parameters (Polynomial or Gaussian kernel).
Keywords—Feature Selection, Learning with Kernels, Support
Vector Machine, and Classification.
I. INTRODUCTION
HILE more and more textual information is available
online, effective retrieval is difficult without good
indexing and summarization of document content. Document
categorization is one solution to this problem. In recent years
a growing number of categorization methods and machine
learning techniques have been developed and applied in
different contexts.
Documents are typically represented as vectors in a features
space. Each word in the vocabulary is represented as a
separate dimension. The number of occurrences of a word in a
document represents the value of the corresponding
component in the document’s vector. This document
representation results in a huge dimensionality of the feature
space, which poses a major problem to text categorization.
The native feature space consists of the unique terms that
occur into the documents, which can be tens or hundreds of
thousands of terms for even a moderate-sized text collection.
Due to the large dimensionality, much time and memory are
needed for training a classifier on a large collection of
documents. For this reason we explore various methods to
Manuscript received June 22, 2006.
D. Morariu is with the Faculty of Engineering, “Lucian Blaga” University
of Sibiu, Computer Science Department, E. Cioran Street, No. 4, 550025
Sibiu, Romania, (phone: 40/0740/092202; e-mail:
daniel.morariu@ulbsibiu.ro).
L. Vintan is with the Faculty of Engineering, “Lucian Blaga” University of
Sibiu, Computer Science Department, E. Cioran Street, No. 4, 550025 Sibiu,
Romania, (e-mail: lucian.vintan@ulbsibiu.ro).
V. Tresp is with the Siemens AG, Information and Comunications, 81739
Munchen, Germany (e-mail: volker.tresp@siemens.com).
reduce the feature space and thus the response time. As we’ll
show the categorization results are better when we work with
a smaller optimized dimension of the feature space. As the
feature space grows, the accuracy of the classifier doesn’t
grow significantly; actually it even can decreases due to noisy
vector elements.
This paper represents a comparative study of feature
selection methods used prior to documents classifications
(Random Selection, Information Gain [4] and SVM feature
selection [5]). Also we studied the influence of the input data
representation on classification accuracy. We have used three
type of representation, Binary, Nominal and Cornell Smart.
For the classification process we used the Support Vector
Machine technique, which has proven to be efficient for
nonlinearly separable input data [8], [9], [11].
The Support Vector Machine (SVM) is actually based on
learning with kernels. A great advantage of this technique is
that it can use large input data and feature sets. Thus, it is easy
to test the influence of the number of features on classification
accuracy. We implemented SVM classification for two types
of kernels: “polynomial kernel” and “Gaussian kernel”
(Radial Basis Function - RBF). We will use a simplified form
of the kernels by correlating the parameters. We have also
modified this SVM representation so that it can be used as a
method of features selection in the text-mining step.
Section 2 and 3 contain prerequisites for the work that we
present in this paper. In section 4 we present the framework
and the methodology used for our experiments. Section 5
presents the main results of our experiments. The last section
debates and concludes on the most important obtained results
and proposes some further work.
II. FEATURE SELECTION METHODS
A substantial fraction of the available information is stored
in text or document databases which consist of a large
collection of documents from various sources such as news
articles, research papers, books, web pages, etc. Data stored in
text format is considered semi-structured data that means
neither completely unstructured nor completely structured. In
text categorization, feature selection is typically performed by
assigning a score or a weight to each term and keeping some
number of terms with the highest scores while discarding the
rest. After this, experiments evaluate the effects that feature
selection has on both the classification performance and the
response time.
Numerous feature scoring measures have been proposed
Daniel Morariu, Lucian N. Vintan, and Volker Tresp
Feature Selection Methods for an Improved
SVM Classifier
W
TRANSACTIONS ON ENGINEERING, COMPUTING AND TECHNOLOGY VOLUME 14 AUGUST 2006 ISSN 1305-5313
ENFORMATIKA V14 2006 ISSN 1305-5313 83 © 2006 WORLD ENFORMATIKA SOCIETY