SEPARABLE PCA FOR IMAGE CLASSIFICATION Yongxin Taylor Xi and Peter J. Ramadge Dept. Electrical Engineering, Princeton University, Princeton NJ ABSTRACT As an alternative to standard PCA, matrix-based image di- mensionality reduction methods have recently been proposed and have gained attention due to reported computational ef- ﬁciency and robust performance in classiﬁcation. We unify all of these methods through one concept: Separable Princi- ple Component Analysis (SPCA). We show that the proposed matrix methods are either equivalent to, special cases of, or approximations to SPCA. We include performance compar- isons of the methods on two face data sets and a handwritten digit data set. The empirical results indicate that two exist- ing methods, BD-PCA and its variant NGLRAM, are very good, efﬁciently computable, approximate solutions to prac- tical SPCA problems. Index Terms— Image classiﬁcation, eigenvalues and eigenfunctions, discrete transforms, image representations, face recognition. 1. INTRODUCTION Principal component analysis (PCA) is an important feature selection method used in many image detection/classiﬁcation schemes. One prominent example is its successful applica- tion in face detection and classiﬁcation, e.g. [1, 2]. How- ever, estimation of the PCA projection from data has some limitations. First, its computational complexity makes it difﬁ- cult to deal directly with high dimensional data, e.g. images. Second, the number of examples available for the estimation of the PCA projection is typically much smaller that the am- bient dimension of the data and this can lead to over ﬁtting of the projection. In an effort to alleviate these problems in image classiﬁcation applications, several variations on stan- dard PCA have recently been proposed [3, 4, 5, 6, 7]. These schemes are reported to have reduced computational burden and, when coupled with appropriate classiﬁers, to yield im- proved and robust classiﬁcation rates [3, 4, 8, 5]. We seek to better understand the relationship of these algorithms with standard methods. Our main contribution is the uniﬁcation of these methods through Separable PCA (SPCA). SPCA seeks a separable ba- sis of images that maximizes the variance of the coordinates over the ensemble of data images. We show that each of the above schemes is either equivalent to, a special case of, or an approximation to SPCA. Speciﬁcally, 2DPCA [3] is an easily solvable special case of SPCA. BD-PCA [4] and NGLRAM [7] project the image data onto a separable basis. We give pre- cise conditions under which BD-PCA is a solution of SPCA and when these conditions are not satisﬁed, show that BD- PCA and NGLRAM give very good approximate solutions to SPCA. Finally, GLRAM [5], a method for obtaining low rank approximations, is equivalent to SPCA. Thus SPCA uniﬁes a variety of prior proposals in the literature. 2. BACKGROUND Let X denote a linear space and Y denote a ﬁnite set of labels. Given a set {(x k ,y k ) ∈X× Y,k =1,...,N } of training examples (x i are instances, y i are labels), we want to design a classiﬁer h : X→ Y that ‘best’ predicts the label of a new test instance x ∈X . For example, each training instance might be an m × n grey scale face image with the associated label being the identiﬁer of the corresponding individual. The PCA approach to this problem uses the training data {x k } N k=1 to determine a linear projection Q : X→ R d into a lower dimensional space. Then the label information is used to design a classiﬁer h : R d → Y . For example, this might be a nearest neighbor classiﬁer in the projected space. It will be helpful to review PCA when X = R s , some in- teger s> 0. Without loss of generality, assume that the data is centered, i.e., ∑ N k=1 x k =0. To select the PCA projec- tion, form the data matrix D =[x 1 ,x 2 ,...,x N ]. The scatter matrix (empirical covariance) is then DD T = ∑ N k=1 x k x T k ∈ R s×s . DD T has at most N − 1 nonzero eigenvalues. Let w j , j =1,...,d, denote the ﬁrst d eigenvectors ordered by eigen- value, largest to smallest. The PCA projection into R d results by setting P =[w 1 ,w 2 ,...,w d ] and ˆ x k = P T x k . In prac- tice, one computes P from an SVD D = U ΣV T , yielding DD T = U Σ 2 U T and P =[u 1 ,...,u d ] where the u j are the ﬁrst d left singular vectors of D. For N  s, the complexity of computing P is O(sN 2 ) in time and O(sN ) in space. When each data point is an m × n grey scale image A k , PCA ﬁnds an ON set {W j } d j=1 of d principal eigenimages of the empirical covariance function of the image data [9]. Im- age A k is then projected to its coordinates with respect to this ON basis, i.e., ˆ a kj = A k ,W j , j =1,...,d, where ·, · is the standard inner product. It is convenient to compute these eigenimages by exploiting an isometry between R m×n and 1805 978-1-4244-2354-5/09/$25.00 ©2009 IEEE ICASSP 2009