702 IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, VOL. 10, NO. 4, JULY 2013 Linear Feature Extraction for Hyperspectral Images Based on Information Theoretic Learning Mehdi Kamandar and Hassan Ghassemian, Senior Member, IEEE Abstract—This letter proposes a new supervised linear feature extractor for hyperspectral image classification. The criterion for feature extraction is a modified maximal relevance and minimal redundancy (MRMD), which has been used for feature selection until now. The MRMD is a function of mutual information terms, which possess higher order statistics of data; thus, it is effective for hyperspectral data with informative higher order statistics. The batch and stochastic versions of the gradient ascent are per- formed on the MRMD to find the optimal parameters of a linear feature extractor. Preliminary results achieve better classification performance than the traditional methods based on the first- and second-order moments of data. Index Terms—Hughes phenomenon, hyperspectral image clas- sification, linear feature extractor, maximal relevance, minimal redundancy. I. I NTRODUCTION H YPERSPECTRAL imagery consists of measurements of a large number of closely spaced narrow bands that cover visible and near-infrared portions of the electromagnetic spec- trum. With the increasing number of the bands, more separation ability is potentially expected, but the expected classification quality is affected by the Hughes phenomenon, which is a kind of a problem of the curse of dimensionality [1]. When the number of bands grows and the size of the training set is limited, the classification accuracy reaches a maximum for a given size and then decreases. Labeling of training samples in remote sensing applications is very time consuming and costly. There are four strategies to mitigate the Hughes phe- nomenon in supervised classification of hyperspectral images, namely, fusion of spectral and spatial information [2], [3], semi- supervised classification [4]–[6], using regularization terms when estimating the parameters of a classifier such as in support vector machine [7], or feature reduction [1]. In this letter, we focus on linear feature extraction meth- ods to mitigate the Hughes phenomenon. Techniques used for this purpose include maximum noise fraction (MNF) [8], independent component analysis (ICA) [9], or unmixing [10] as unsupervised ones; and projection pursuit (PP), with the Bhattacharyya distance (BD) as projection index [11], or the modified linear discriminant analysis (LDA) [12] as supervised ones. An optimal criterion for feature extraction would natu- Manuscript received May 27, 2012; revised August 15, 2012; accepted September 10, 2012. This work was supported in part by the Iran Telecom- munication Research Centre under Grant 8991/500. The authors are with the Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran 14155-4843, Iran (e-mail: m.kamandar@ modares.ac.ir; ghassemi@modares.ac.ir). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/LGRS.2012.2219575 Fig. 1. Typical steps for hyperspectral image classification. rally reflect the Bayes error rate in the transformed space, but it is not particularly convenient to use the Bayes error rate itself as a criterion due to computational difficulties. Instead of it, a separability measure such as the BD or the Fisher ratio (FR) can be used with less computational complexity (they are functions of the first- and second-order moments of data). In this letter, a modified maximal relevance and minimal redun- dancy (MRMD) is used as a criterion for feature extraction. The MRMD is a function of mutual information (MI) terms, which possess the higher order statistics of data, whereas BD and FR possess the first- and second-order ones. Thus, the MRMD is more effective for hyperspectral data with informa- tive higher order statistics. We calculate optimal parameters of a linear feature extractor by maximizing the modified MRMD with respect to them. To reduce the computational cost of the MRMD maximization, the batch or the stochastic versions of the gradient ascent is used. In the next section, a modified MRMD is introduced. In Section III, the proposed linear feature extractor is described. Section IV presents experiments to provide a comparison be- tween the proposed linear feature extractor and the others in the literature. Finally, Section V draws the conclusion. II. MRMD CRITERION Fig. 1 illustrates a typical scenario for hyperspectral im- age classification, where x =[x 1 x 2 ...x n ] represents the n hyperspectral bands, and y =[y 1 y 2 ...y m ] is the m extracted features (m<n). The optimal features are calculated by mini- mizing their Bayes error rate, which is given by P B e (y)= y f (y)(1 max p P (c p |y)dy (1) where c p denotes the class label for pth class, f (y) is the probability density function (pdf) of y, and P (c p |y) is the a posteriori probability of pth class. Now, the optimal feature extractor can be determined by g (.) = arg min g(.) P B e (g(x)) . (2) The Bayes error rate suffers from its high computational com- plexity. Its calculation needs to estimate a posteriori probabili- ties of classes and the multivariate pdf of features, as well as 1545-598X/$31.00 © 2012 IEEE