Pattern Recognition Letters 2 (1983) 61-68 December 1983 North-Holland A unified approach to optimal feature selection Salvatore D. MORGERA Concordia University, Dept. of Elect. Engin. 14915.13, 145~ de Maisonneuve Blvd. West, Montr6al, H3G 1M8 Quebec, Canada Received 11 March 1983 Revised 23 April 1983 Abstract: The optimum finite set of linear observables for discriminating two Gaussian stochastic processes is derived using classical methods and distribution function theory. The results offer a new, accurate information-theoretic strategy and are superior to well-known conventional methods using statistical distance measures. Key words: Feature extraction, data compression, Bayesian error, distribution function, Toeplitz matrix, entropy, statistical distance measures. 1. Introduction Let x be a real (Nx 1)-dimensional data vector with a multivariate normal (MVN) distribution, N(0, ~i), under hypothesis H i, i = l, 2. Assume that Hi has a priori probability hi, i= 1,2; r h + n2 = 1, rq ~0, 1. We wish to apply an (n ×N)-dimensionai data-reducing (n<_N) transformation ~ to the data to obtain the feature vector y, i.e. y=.Vx. (I) Assume the rows of .z/ are linearly independent (l.i.). The feature vector y has an MVN distribu- tion, N(0,5~), where Y/=~i dr, under hypo- thesis Hi, i=1,2, respectively. Let .~* be the matrix which simultaneously diagonalizes the pair (5'i,5'2) into the diagonal pair (In,A*). In the usual manner, define the region ~ CA n as the critical region for rejecting Hi, given by (1") = Iy* ] y*T[l n -- A*-I]y* x . Note: Research supported by Canadian Natural Sciences and Engineering Research Council (NSERC) Grant A0912. where * * * Al->12_>..'-->ln>0 are the diagonal ele- ments of A*, which we assemble as the canonical variate 2". The probability of classification error based on n features is then Pe(n;2*)=zttll(n;1*)+nzI2(n;2*), (3a) where l,(n;1*)= ProbI ~=,(l - ~)z2 i=l \n2,/.) h(n; i*) = Prob (2i* - 1)z~ i 1 with the zi being statistically independent N(0, I) variates. Due to the general feeling that (3) cannot be directly useful in selecting a transformation :/ which minimizes Pc(n; 1"), many workers in pat- tern recognition, control systems, communications, and information theory have bounded Pc(n; 1") in terms of statistical distance measures [Kazakos et al. (1980), Kadota et al. (1967), Tou et al. 0967), and Kanal (1974)]. In particular, the feature selec- 016%8655/83/$3.00 © 1983, Elsevier Science Publishers B.V. (North-Holland) 61