Pattern Recognition 41 (2008) 3706--3719
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/pr
Feature selection using localized generalization error for supervised classification
problems using RBFNN
Wing W.Y. Ng
a,b, ∗
, Daniel S. Yeung
a,b
, Michael Firth
c
, Eric C.C. Tsang
b
, Xi-Zhao Wang
d
a
School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China
b
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong
c
Department of Finance and Insurance, Lingnan University, Hong Kong
d
Machine Learning Center, Faculty of Mathematics and Computer Science, Hebei University, Baoding 071002, China
ARTICLE INFO ABSTRACT
Article history:
Received 1 March 2007
Received in revised form 1 March 2008
Accepted 5 May 2008
Keywords:
Feature selection
Neural network
Generalization error
RBFNN
A pattern classification problem usually involves using high-dimensional features that make the classifier
very complex and difficult to train. With no feature reduction, both training accuracy and generalization
capability will suffer. This paper proposes a novel hybrid filter–wrapper-type feature subset selection
methodology using a localized generalization error model. The localized generalization error model for
a radial basis function neural network bounds from above the generalization error for unseen samples
located within a neighborhood of the training samples. Iteratively, the feature making the smallest con-
tribution to the generalization error bound is removed. Moreover, the novel feature selection method
is independent of the sample size and is computationally fast. The experimental results show that the
proposed method consistently removes large percentages of features with statistically insignificant loss of
testing accuracy for unseen samples. In the experiments for two of the datasets, the classifiers built using
feature subsets with 90% of features removed by our proposed approach yield average testing accuracies
higher than those trained using the full set of features. Finally, we corroborate the efficacy of the model
by using it to predict corporate bankruptcies in the US.
© 2008 Elsevier Ltd. All rights reserved.
1. Introduction
With the availability of fast computers, broadband Internet, and
cheap, high capacity storage, datasets have become ever larger. Usu-
ally, domain knowledge and personal bias influence the choice of
features. Although these parameters may not fully describe the prob-
lem, some parameters may be included just for fear of losing some-
thing useful. When the number of parameters (input features) of the
dataset becomes large, the pattern classification systems trained for
differentiating the sample points into different classes also get more
complex. On the other hand, if it is not necessary to collect so many
input features, the cost of data collection and storage will be reduced.
A major problem in pattern classification is how to build a sim-
ple classifier that has good performance. By “good performance” we
mean a system that can be quickly trained, is highly accurate and
responds quickly to future unseen samples, and is easily understood
∗
Corresponding author at: School of Computer Science and Engineering, South
China University of Technology, Guangzhou 510640, China.
E-mail addresses: wingng@ieee.org (W.W.Y. Ng), csdaniel@comp.polyu.edu.hk
(D.S. Yeung), mafirth@ln.edu.hk (M. Firth), csetsang@comp.polyu.edu.hk (E.C.C.
Tsang), wangxz@mail.hbu.edu.cn (X.-Z. Wang).
0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.patcog.2008.05.004
by people. Perhaps the most straightforward way to reduce the com-
plexity of a classifier is to reduce the number of input features.
Given the training dataset D ={(x
b
, F(x
b
))}
N
b=1
consisting of N
training samples (x
b
) with F denoting the unknown input–output
mapping of the classification problem that one would like to approx-
imate using a classifier (e.g. a neural network), the training error
(R
emp
) and generalization error (R
true
) for the entire input space (T )
of the classifier f
are defined as
R
emp
=
1
N
N
b=1
(F(x
b
) - f
(x
b
))
2
(1)
R
true
=
T
(F(x) - f
(x))
2
p(x)dx (2)
where p(x) denotes the true unknown probability density function
of x, and denotes the set of parameters in the classifier f
. The ulti-
mate goal of training a classifier is to minimize the generalization er-
ror for unseen samples (i.e. minimizing the differences between the
real unknown input–output mapping function and the mapping ap-
proximated by f
). Moreover the ultimate goal of feature selection is
to maintain the classifier's generalization capability using a reduced