Pattern Recognition 41 (2008) 3706--3719 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Feature selection using localized generalization error for supervised classification problems using RBFNN Wing W.Y. Ng a,b, , Daniel S. Yeung a,b , Michael Firth c , Eric C.C. Tsang b , Xi-Zhao Wang d a School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China b Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong c Department of Finance and Insurance, Lingnan University, Hong Kong d Machine Learning Center, Faculty of Mathematics and Computer Science, Hebei University, Baoding 071002, China ARTICLE INFO ABSTRACT Article history: Received 1 March 2007 Received in revised form 1 March 2008 Accepted 5 May 2008 Keywords: Feature selection Neural network Generalization error RBFNN A pattern classification problem usually involves using high-dimensional features that make the classifier very complex and difficult to train. With no feature reduction, both training accuracy and generalization capability will suffer. This paper proposes a novel hybrid filter–wrapper-type feature subset selection methodology using a localized generalization error model. The localized generalization error model for a radial basis function neural network bounds from above the generalization error for unseen samples located within a neighborhood of the training samples. Iteratively, the feature making the smallest con- tribution to the generalization error bound is removed. Moreover, the novel feature selection method is independent of the sample size and is computationally fast. The experimental results show that the proposed method consistently removes large percentages of features with statistically insignificant loss of testing accuracy for unseen samples. In the experiments for two of the datasets, the classifiers built using feature subsets with 90% of features removed by our proposed approach yield average testing accuracies higher than those trained using the full set of features. Finally, we corroborate the efficacy of the model by using it to predict corporate bankruptcies in the US. © 2008 Elsevier Ltd. All rights reserved. 1. Introduction With the availability of fast computers, broadband Internet, and cheap, high capacity storage, datasets have become ever larger. Usu- ally, domain knowledge and personal bias influence the choice of features. Although these parameters may not fully describe the prob- lem, some parameters may be included just for fear of losing some- thing useful. When the number of parameters (input features) of the dataset becomes large, the pattern classification systems trained for differentiating the sample points into different classes also get more complex. On the other hand, if it is not necessary to collect so many input features, the cost of data collection and storage will be reduced. A major problem in pattern classification is how to build a sim- ple classifier that has good performance. By “good performance” we mean a system that can be quickly trained, is highly accurate and responds quickly to future unseen samples, and is easily understood Corresponding author at: School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China. E-mail addresses: wingng@ieee.org (W.W.Y. Ng), csdaniel@comp.polyu.edu.hk (D.S. Yeung), mafirth@ln.edu.hk (M. Firth), csetsang@comp.polyu.edu.hk (E.C.C. Tsang), wangxz@mail.hbu.edu.cn (X.-Z. Wang). 0031-3203/$ - see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2008.05.004 by people. Perhaps the most straightforward way to reduce the com- plexity of a classifier is to reduce the number of input features. Given the training dataset D ={(x b , F(x b ))} N b=1 consisting of N training samples (x b ) with F denoting the unknown input–output mapping of the classification problem that one would like to approx- imate using a classifier (e.g. a neural network), the training error (R emp ) and generalization error (R true ) for the entire input space (T ) of the classifier f are defined as R emp = 1 N N b=1 (F(x b ) - f (x b )) 2 (1) R true = T (F(x) - f (x)) 2 p(x)dx (2) where p(x) denotes the true unknown probability density function of x, and denotes the set of parameters in the classifier f . The ulti- mate goal of training a classifier is to minimize the generalization er- ror for unseen samples (i.e. minimizing the differences between the real unknown input–output mapping function and the mapping ap- proximated by f ). Moreover the ultimate goal of feature selection is to maintain the classifier's generalization capability using a reduced