Pattern Recognition Letters 2 (1983) 61-68 December 1983
North-Holland
A unified approach to optimal feature selection
Salvatore D. MORGERA
Concordia University, Dept. of Elect. Engin. 14915.13, 145~ de Maisonneuve Blvd. West, Montr6al, H3G 1M8 Quebec,
Canada
Received 11 March 1983
Revised 23 April 1983
Abstract: The optimum finite set of linear observables for discriminating two Gaussian stochastic processes is derived using
classical methods and distribution function theory. The results offer a new, accurate information-theoretic strategy and are
superior to well-known conventional methods using statistical distance measures.
Key words: Feature extraction, data compression, Bayesian error, distribution function, Toeplitz matrix, entropy, statistical
distance measures.
1. Introduction
Let x be a real (Nx 1)-dimensional data vector
with a multivariate normal (MVN) distribution,
N(0, ~i), under hypothesis H i, i = l, 2. Assume that
Hi has a priori probability hi, i= 1,2; r h + n2 = 1,
rq ~0, 1. We wish to apply an (n ×N)-dimensionai
data-reducing (n<_N) transformation ~ to the
data to obtain the feature vector y, i.e.
y=.Vx. (I)
Assume the rows of .z/ are linearly independent
(l.i.). The feature vector y has an MVN distribu-
tion, N(0,5~), where Y/=~i dr, under hypo-
thesis Hi, i=1,2, respectively. Let .~* be the
matrix which simultaneously diagonalizes the pair
(5'i,5'2) into the diagonal pair (In,A*). In the
usual manner, define the region ~ CA n as the
critical region for rejecting Hi, given by
(1") = Iy* ] y*T[l n -- A*-I]y*
x .
Note: Research supported by Canadian Natural Sciences and
Engineering Research Council (NSERC) Grant A0912.
where * * * Al->12_>..'-->ln>0 are the diagonal ele-
ments of A*, which we assemble as the canonical
variate 2". The probability of classification error
based on n features is then
Pe(n;2*)=zttll(n;1*)+nzI2(n;2*), (3a)
where
l,(n;1*)= ProbI ~=,(l - ~)z2
i=l \n2,/.)
h(n; i*) = Prob (2i* - 1)z~
i
1
with the zi being statistically independent N(0, I)
variates.
Due to the general feeling that (3) cannot be
directly useful in selecting a transformation :/
which minimizes Pc(n; 1"), many workers in pat-
tern recognition, control systems, communications,
and information theory have bounded Pc(n; 1") in
terms of statistical distance measures [Kazakos et
al. (1980), Kadota et al. (1967), Tou et al. 0967),
and Kanal (1974)]. In particular, the feature selec-
016%8655/83/$3.00 © 1983, Elsevier Science Publishers B.V. (North-Holland) 61