Structural feature selection for wrapper methods Gianluca Bontempi ULB Machine Learning Group Universit´ e Libre de Bruxelles 1050 Brussels - Belgium email: gbonte@ulb.ac.be Abstract. The wrapper approach to feature selection requires the as- sessment of several subset alternatives and the selection of the one which is expected to have the lowest generalization error. To tackle this prob- lem, practitioners have often recourse to a search procedure in a very large space of subsets of features aiming to minimize a leave-one-out or more in general a cross-validation criterion. It has been previously discussed in literature, how this practice can lead to a strong bias selection in the case of high dimensionality problems. We propose here an alternative method, inspired by structural identification in model selection, which re- places a single global search by a number of searches into a sequence of nested spaces of features with an increasing number of variables. The pa- per presents some promising, although preliminary results on several real nonlinear regression problems. 1 Introduction Consider a multivariate supervised learning problem where n is the size of the input vector X = {x 1 ,...,x n }. In the case of a very large n, it is common practice in machine learning to adopt feature selection algorithms [2] to improve the generalization accuracy. A well-known example is the wrapper technique [4] where the feature subset selection algorithm acts as a wrapper around the learn- ing algorithm, considered as a black box that assesses (e.g. via cross-validation) feature subsets. If we denote by S =2 X the power set of X , the goal of a wrapper algorithm is to return a subset s ∈ S of features with low prediction error. In this paper we will focus on the expected value of the squared error, also known as mean integrated squared error (MISE), as measure of the prediction error. Since the MISE is not directly measurable but can only be estimated, the feature selection problem may be formulated in terms of a stochastic optimiza- tion problem [3] where the selection of the best subset s has to be based on a sample estimate MISE. Consider the stochastic minimization of the positive function g(s)= E[G(s)],s ∈ S , that is the expected value function of a random function G(s) > 0. Let G(s) be a realization of G(s) and ˆ s = arg min s∈S G(s) (1) In general terms, coupling the estimation of an expected value function g(s) with the optimization of the function itself should be tackled very cautiously because ESANN'2005 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), 27-29 April 2005, d-side publi., ISBN 2-930307-05-6. 405