IEEE Proof IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Multiclass Feature Selection With Kernel Gram-Matrix-Based Criteria Mathieu Ramona, Member, IEEE, Gaël Richard, Senior Member, IEEE , and Bertrand David, Member, IEEE Abstract— Feature selection has been an important issue in recent decades to determine the most relevant features according to a given classification problem. Numerous methods have emerged that take into account support vector machines (SVMs) in the selection process. Such approaches are powerful but often complex and costly. In this paper, we propose new feature selection methods based on two criteria designed for the optimization of SVM: kernel target alignment and kernel class separability. We demonstrate how these two measures, when fully expressed, can build efficient and simple methods, easily applicable to multiclass problems and iteratively computable with minimal memory requirements. An extensive experimental study is conducted both on artificial and real-world datasets to compare the proposed methods to state-of-the-art feature selection algorithms. The results demonstrate the relevance of the proposed methods both in terms of performance and computational cost. Index Terms— Audio classification, feature selection, kernel class separability, kernel target alignment (KTA), support vector machines (SVMs), variable selection. NOTATIONS 1 k , 1 Denotes a k × k unit matrix filled with 1. A B Entry-wise product of two matrices or vectors. A, B F Frobenius inner product of two matrices. || A|| F The corresponding norm. ( A) Sum of all entries of matrix (( A) = i, j a ij ). θ x Partial derivative of x with regard to θ (i.e., x /∂θ ). w h Denotes the hyperplane normal vectors. w Denotes the scale factors of a scaled kernel k w . I. I NTRODUCTION I N THE context of supervised pattern recognition, the gathering of large datasets has become a common process with the availability of more sensors and the increase of computational resources. But the accumulation of data is not necessarily profitable for pattern recognition systems, Manuscript received November 11, 2011; revised May 16, 2012; accepted May 18, 2012. M. Ramona was with the Institut Mines-Telecom, Telecom ParisTech, CNRS LTCI, Paris F-75634, France. He is now with the Sound Analysis and Synthesis Team, IRCAM/CNRS-STMS, Paris 75004, France (e-mail: mathieu.ramona@ircam.fr). G. Richard and B. David are with the Institut Mines-Telecom, Telecom ParisTech, CNRS LTCI, Paris F-75634, France (e-mail: gael.richard@telecom-paristech.fr; bertrand.david@telecom-paristech.fr). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2012.2201748 which generally face the so-called curse of dimensionality (explained in [1] by the fact that a high-dimensional space, populated by a finite set, is nearly empty). Sparseness of the training set results in the classifier’s overfitting and thus penalizes generalization. Moreover, large collections of features generally contain highly correlated descriptors derived from the same sources, or irrelevant ones, feeding the learning process with unreliable information. Feature selection aims at determining the most relevant features according to a given problem. Dimension reduction and the removal of irrelevant features are meant to enhance generalization performances but also allow some insights into the problem through the interpretation of the most relevant features. This also yields an important cost reduction both in storage need and computational speed. According to [2] and [3], feature selection methods divide between filters, built as preprocessing steps of the classification and thus independent of the classifier, and wrappers that use the classifier as a black box to operate the feature selection. However, even filter selection is related to the classifier, as the selection criterion is always based on an assumption on the classification process. Linear discriminant aims at determining an optimal hyperplane separating both classes’ examples, but the choice of the optimality criterion implies underlying assumptions. Support vector machines (SVMs) lie on the distance between the separating hyperplane and the closest examples, the so-called margin. The problem, widely explored [4], [5], is solved through quadratic programming. The “kernel trick” further introduces nonlinearity by substituting a kernel function k ( x , y) = ( x ), ( y)to inner products (where can be implicit), under some restrictions over the choice of k . The target space of is generally called the feature space, and has a much higher dimension (possibly infinite) than the original input space. This transformation widens the range and complexity of possible decision surfaces in the input space. Several methods address the problem of taking the SVM underlying the process into account in the feature selection step, among which the radius-margin bound [6] shows very good results in practice. Nevertheless, they often involve multiple SVM trainings and even other optimization processes as part of the feature selection process, and are thus com- putationally expensive. Moreover, some are not designed to scale up to very large datasets. We propose here three new feature selection methods based on the kernel target alignment (KTA) and kernel class separability (KCS) criteria, which are evaluated iteratively from the sole Gram matrix values and are thus simple and very scalable in terms of memory. 2162–237X/$31.00 © 2012 IEEE