IEEE
Proof
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1
Multiclass Feature Selection With
Kernel Gram-Matrix-Based Criteria
Mathieu Ramona, Member, IEEE, Gaël Richard, Senior Member, IEEE , and Bertrand David, Member, IEEE
Abstract— Feature selection has been an important issue
in recent decades to determine the most relevant features
according to a given classification problem. Numerous methods
have emerged that take into account support vector machines
(SVMs) in the selection process. Such approaches are powerful
but often complex and costly. In this paper, we propose new
feature selection methods based on two criteria designed for the
optimization of SVM: kernel target alignment and kernel class
separability. We demonstrate how these two measures, when
fully expressed, can build efficient and simple methods, easily
applicable to multiclass problems and iteratively computable with
minimal memory requirements. An extensive experimental study
is conducted both on artificial and real-world datasets to compare
the proposed methods to state-of-the-art feature selection
algorithms. The results demonstrate the relevance of the proposed
methods both in terms of performance and computational cost.
Index Terms— Audio classification, feature selection, kernel
class separability, kernel target alignment (KTA), support vector
machines (SVMs), variable selection.
NOTATIONS
1
k
, 1 Denotes a k × k unit matrix filled with 1.
A ◦ B Entry-wise product of two matrices or vectors.
〈 A, B〉
F
Frobenius inner product of two matrices.
|| A||
F
The corresponding norm.
( A) Sum of all entries of matrix
(( A) =
∑
i, j
a
ij
).
∂
θ
x Partial derivative of x with regard to θ
(i.e., ∂ x /∂θ ).
w
h
Denotes the hyperplane normal vectors.
w Denotes the scale factors of a scaled
kernel k
w
.
I. I NTRODUCTION
I
N THE context of supervised pattern recognition, the
gathering of large datasets has become a common process
with the availability of more sensors and the increase of
computational resources. But the accumulation of data is
not necessarily profitable for pattern recognition systems,
Manuscript received November 11, 2011; revised May 16, 2012; accepted
May 18, 2012.
M. Ramona was with the Institut Mines-Telecom, Telecom ParisTech,
CNRS LTCI, Paris F-75634, France. He is now with the Sound Analysis
and Synthesis Team, IRCAM/CNRS-STMS, Paris 75004, France (e-mail:
mathieu.ramona@ircam.fr).
G. Richard and B. David are with the Institut Mines-Telecom,
Telecom ParisTech, CNRS LTCI, Paris F-75634, France (e-mail:
gael.richard@telecom-paristech.fr; bertrand.david@telecom-paristech.fr).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TNNLS.2012.2201748
which generally face the so-called curse of dimensionality
(explained in [1] by the fact that a high-dimensional space,
populated by a finite set, is nearly empty). Sparseness of
the training set results in the classifier’s overfitting and
thus penalizes generalization. Moreover, large collections
of features generally contain highly correlated descriptors
derived from the same sources, or irrelevant ones, feeding the
learning process with unreliable information.
Feature selection aims at determining the most relevant
features according to a given problem. Dimension reduction
and the removal of irrelevant features are meant to enhance
generalization performances but also allow some insights into
the problem through the interpretation of the most relevant
features. This also yields an important cost reduction both in
storage need and computational speed.
According to [2] and [3], feature selection methods divide
between filters, built as preprocessing steps of the classification
and thus independent of the classifier, and wrappers that use
the classifier as a black box to operate the feature selection.
However, even filter selection is related to the classifier, as
the selection criterion is always based on an assumption on
the classification process.
Linear discriminant aims at determining an optimal
hyperplane separating both classes’ examples, but the choice
of the optimality criterion implies underlying assumptions.
Support vector machines (SVMs) lie on the distance between
the separating hyperplane and the closest examples, the
so-called margin. The problem, widely explored [4], [5], is
solved through quadratic programming. The “kernel trick”
further introduces nonlinearity by substituting a kernel
function k ( x , y) = 〈( x ), ( y)〉 to inner products (where
can be implicit), under some restrictions over the choice of k .
The target space of is generally called the feature space,
and has a much higher dimension (possibly infinite) than the
original input space. This transformation widens the range and
complexity of possible decision surfaces in the input space.
Several methods address the problem of taking the SVM
underlying the process into account in the feature selection
step, among which the radius-margin bound [6] shows very
good results in practice. Nevertheless, they often involve
multiple SVM trainings and even other optimization processes
as part of the feature selection process, and are thus com-
putationally expensive. Moreover, some are not designed to
scale up to very large datasets. We propose here three new
feature selection methods based on the kernel target alignment
(KTA) and kernel class separability (KCS) criteria, which
are evaluated iteratively from the sole Gram matrix values
and are thus simple and very scalable in terms of memory.
2162–237X/$31.00 © 2012 IEEE