702 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 10, NO. 4, JULY 2013
Linear Feature Extraction for Hyperspectral Images
Based on Information Theoretic Learning
Mehdi Kamandar and Hassan Ghassemian, Senior Member, IEEE
Abstract—This letter proposes a new supervised linear feature
extractor for hyperspectral image classification. The criterion for
feature extraction is a modified maximal relevance and minimal
redundancy (MRMD), which has been used for feature selection
until now. The MRMD is a function of mutual information terms,
which possess higher order statistics of data; thus, it is effective
for hyperspectral data with informative higher order statistics.
The batch and stochastic versions of the gradient ascent are per-
formed on the MRMD to find the optimal parameters of a linear
feature extractor. Preliminary results achieve better classification
performance than the traditional methods based on the first- and
second-order moments of data.
Index Terms—Hughes phenomenon, hyperspectral image clas-
sification, linear feature extractor, maximal relevance, minimal
redundancy.
I. I NTRODUCTION
H
YPERSPECTRAL imagery consists of measurements of
a large number of closely spaced narrow bands that cover
visible and near-infrared portions of the electromagnetic spec-
trum. With the increasing number of the bands, more separation
ability is potentially expected, but the expected classification
quality is affected by the Hughes phenomenon, which is a
kind of a problem of the curse of dimensionality [1]. When
the number of bands grows and the size of the training set is
limited, the classification accuracy reaches a maximum for a
given size and then decreases. Labeling of training samples
in remote sensing applications is very time consuming and
costly. There are four strategies to mitigate the Hughes phe-
nomenon in supervised classification of hyperspectral images,
namely, fusion of spectral and spatial information [2], [3], semi-
supervised classification [4]–[6], using regularization terms
when estimating the parameters of a classifier such as in support
vector machine [7], or feature reduction [1].
In this letter, we focus on linear feature extraction meth-
ods to mitigate the Hughes phenomenon. Techniques used
for this purpose include maximum noise fraction (MNF) [8],
independent component analysis (ICA) [9], or unmixing [10]
as unsupervised ones; and projection pursuit (PP), with the
Bhattacharyya distance (BD) as projection index [11], or the
modified linear discriminant analysis (LDA) [12] as supervised
ones. An optimal criterion for feature extraction would natu-
Manuscript received May 27, 2012; revised August 15, 2012; accepted
September 10, 2012. This work was supported in part by the Iran Telecom-
munication Research Centre under Grant 8991/500.
The authors are with the Faculty of Electrical and Computer Engineering,
Tarbiat Modares University, Tehran 14155-4843, Iran (e-mail: m.kamandar@
modares.ac.ir; ghassemi@modares.ac.ir).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/LGRS.2012.2219575
Fig. 1. Typical steps for hyperspectral image classification.
rally reflect the Bayes error rate in the transformed space, but
it is not particularly convenient to use the Bayes error rate
itself as a criterion due to computational difficulties. Instead of
it, a separability measure such as the BD or the Fisher ratio
(FR) can be used with less computational complexity (they are
functions of the first- and second-order moments of data). In
this letter, a modified maximal relevance and minimal redun-
dancy (MRMD) is used as a criterion for feature extraction.
The MRMD is a function of mutual information (MI) terms,
which possess the higher order statistics of data, whereas BD
and FR possess the first- and second-order ones. Thus, the
MRMD is more effective for hyperspectral data with informa-
tive higher order statistics. We calculate optimal parameters of
a linear feature extractor by maximizing the modified MRMD
with respect to them. To reduce the computational cost of the
MRMD maximization, the batch or the stochastic versions of
the gradient ascent is used.
In the next section, a modified MRMD is introduced. In
Section III, the proposed linear feature extractor is described.
Section IV presents experiments to provide a comparison be-
tween the proposed linear feature extractor and the others in the
literature. Finally, Section V draws the conclusion.
II. MRMD CRITERION
Fig. 1 illustrates a typical scenario for hyperspectral im-
age classification, where x =[x
1
x
2
...x
n
] represents the n
hyperspectral bands, and y =[y
1
y
2
...y
m
] is the m extracted
features (m<n). The optimal features are calculated by mini-
mizing their Bayes error rate, which is given by
P
B
e
(y)=
y
f (y)(1 − max
p
P (c
p
|y)dy (1)
where c
p
denotes the class label for pth class, f (y) is the
probability density function (pdf) of y, and P (c
p
|y) is the
a posteriori probability of pth class. Now, the optimal feature
extractor can be determined by
g
∗
(.) = arg min
g(.)
P
B
e
(g(x)) . (2)
The Bayes error rate suffers from its high computational com-
plexity. Its calculation needs to estimate a posteriori probabili-
ties of classes and the multivariate pdf of features, as well as
1545-598X/$31.00 © 2012 IEEE