SEPARABLE PCA FOR IMAGE CLASSIFICATION Yongxin Taylor Xi and Peter J. Ramadge Dept. Electrical Engineering, Princeton University, Princeton NJ ABSTRACT As an alternative to standard PCA, matrix-based image di- mensionality reduction methods have recently been proposed and have gained attention due to reported computational ef- ficiency and robust performance in classification. We unify all of these methods through one concept: Separable Princi- ple Component Analysis (SPCA). We show that the proposed matrix methods are either equivalent to, special cases of, or approximations to SPCA. We include performance compar- isons of the methods on two face data sets and a handwritten digit data set. The empirical results indicate that two exist- ing methods, BD-PCA and its variant NGLRAM, are very good, efficiently computable, approximate solutions to prac- tical SPCA problems. Index TermsImage classification, eigenvalues and eigenfunctions, discrete transforms, image representations, face recognition. 1. INTRODUCTION Principal component analysis (PCA) is an important feature selection method used in many image detection/classification schemes. One prominent example is its successful applica- tion in face detection and classification, e.g. [1, 2]. How- ever, estimation of the PCA projection from data has some limitations. First, its computational complexity makes it diffi- cult to deal directly with high dimensional data, e.g. images. Second, the number of examples available for the estimation of the PCA projection is typically much smaller that the am- bient dimension of the data and this can lead to over fitting of the projection. In an effort to alleviate these problems in image classification applications, several variations on stan- dard PCA have recently been proposed [3, 4, 5, 6, 7]. These schemes are reported to have reduced computational burden and, when coupled with appropriate classifiers, to yield im- proved and robust classification rates [3, 4, 8, 5]. We seek to better understand the relationship of these algorithms with standard methods. Our main contribution is the unification of these methods through Separable PCA (SPCA). SPCA seeks a separable ba- sis of images that maximizes the variance of the coordinates over the ensemble of data images. We show that each of the above schemes is either equivalent to, a special case of, or an approximation to SPCA. Specifically, 2DPCA [3] is an easily solvable special case of SPCA. BD-PCA [4] and NGLRAM [7] project the image data onto a separable basis. We give pre- cise conditions under which BD-PCA is a solution of SPCA and when these conditions are not satisfied, show that BD- PCA and NGLRAM give very good approximate solutions to SPCA. Finally, GLRAM [5], a method for obtaining low rank approximations, is equivalent to SPCA. Thus SPCA unifies a variety of prior proposals in the literature. 2. BACKGROUND Let X denote a linear space and Y denote a finite set of labels. Given a set {(x k ,y k ) ∈X× Y,k =1,...,N } of training examples (x i are instances, y i are labels), we want to design a classifier h : X→ Y that ‘best’ predicts the label of a new test instance x ∈X . For example, each training instance might be an m × n grey scale face image with the associated label being the identifier of the corresponding individual. The PCA approach to this problem uses the training data {x k } N k=1 to determine a linear projection Q : X→ R d into a lower dimensional space. Then the label information is used to design a classifier h : R d Y . For example, this might be a nearest neighbor classifier in the projected space. It will be helpful to review PCA when X = R s , some in- teger s> 0. Without loss of generality, assume that the data is centered, i.e., N k=1 x k =0. To select the PCA projec- tion, form the data matrix D =[x 1 ,x 2 ,...,x N ]. The scatter matrix (empirical covariance) is then DD T = N k=1 x k x T k R s×s . DD T has at most N 1 nonzero eigenvalues. Let w j , j =1,...,d, denote the first d eigenvectors ordered by eigen- value, largest to smallest. The PCA projection into R d results by setting P =[w 1 ,w 2 ,...,w d ] and ˆ x k = P T x k . In prac- tice, one computes P from an SVD D = U ΣV T , yielding DD T = U Σ 2 U T and P =[u 1 ,...,u d ] where the u j are the first d left singular vectors of D. For N s, the complexity of computing P is O(sN 2 ) in time and O(sN ) in space. When each data point is an m × n grey scale image A k , PCA finds an ON set {W j } d j=1 of d principal eigenimages of the empirical covariance function of the image data [9]. Im- age A k is then projected to its coordinates with respect to this ON basis, i.e., ˆ a kj = A k ,W j , j =1,...,d, where ·, · is the standard inner product. It is convenient to compute these eigenimages by exploiting an isometry between R m×n and 1805 978-1-4244-2354-5/09/$25.00 ©2009 IEEE ICASSP 2009