USING THE KERNEL TRICK IN COMPRESSIVE SENSING: ACCURATE SIGNAL RECOVERY FROM FEWER MEASUREMENTS Hanchao Qi, Shannon Hughes Department of Electrical, Computer, and Energy Engineering University of Colorado at Boulder ABSTRACT Compressive sensing accurately reconstructs a signal that is sparse in some basis from measurements, generally consisting of the signal’s inner products with Gaussian random vectors. The number of measurements needed is based on the sparsity of the signal, allow- ing for signal recovery from far fewer measurements than is required by the traditional Shannon sampling theorem. In this paper, we show how to apply the kernel trick, popular in machine learning, to adapt compressive sensing to a different type of sparsity. We consider a signal to be “nonlinearly K-sparse” if the signal can be recovered as a nonlinear function of K underlying parameters. Images that lie along a low-dimensional manifold are good examples of this type of nonlinear sparsity. It has been shown that natural images are as well [1]. We show how to accurately recover these nonlinearly K-sparse signals from approximately 2K measurements, which is often far lower than the number of measurements usually required under the assumption of sparsity in an orthonormal basis (e.g. wavelets). In experimental results, we ﬁnd that we can recover images far better for small numbers of compressive sensing measurements, some- times reducing the mean square error (MSE) of the recovered image by an order of magnitude or more, with little computation. A bound on the error of our recovered signal is also proved. Index Terms— Compressive sensing, Kernel methods. 1. INTRODUCTION The “kernel trick” in machine learning is a way to easily adapt lin- ear algorithms to nonlinear situations. For example, by applying the kernel trick to the support vector machines (SVM) algorithm, which constructs the best linear hyperplane separating data points belong- ing to two different classes, we obtain the kernel SVM algorithm, an algorithm that constructs the best curved boundary separating data points belonging to two different classes. Similarly, principal com- ponents analysis (PCA), selects the best linear projection of the data to minimize error between the original and projected data. Kernel PCA ﬁnds the best smooth polynomial mapping to represent data. The key idea of the kernel trick is that, conceptually, we map our data from the original data space R P to a much higher-dimensional feature space F using the nonlinear mapping Φ: R P →F before applying the usual linear algorithm such as SVM or PCA in the fea- ture space. As an example, we might map a point (x1,x2) ∈ R 2 onto the higher-dimensional vector with components x1, x2, x 2 1 , x 2 2 , x1x2, x 3 1 , etc. before applying SVM or PCA. A linear bound- ary in the higher-dimensional feature space ( ∑ j aj Φ(x)j = C), can be expressed as a polynomial boundary in the original space (a0x1 + a1x2 + a2x 2 1 + ... = C). Similarly, a linear mapping of the data becomes a polynomial mapping. However, this view of the kernel trick is purely conceptual. In reality, we avoid the complexity of mapping to and working in the high-dimensional feature space. When the original algorithm can be written in terms of only inner products between points, not the points themselves, we can replace the original inner product 〈x, y〉 with the new inner product k(x, y)= 〈Φ(x), Φ(y)〉 and run the original algorithm without additional computation. For example, a popular choice of k(x, y) is (〈x, y〉 + c) d , which produces a Φ of monomials as described above. As an illustration, for x, y ∈ R 2 , c =0, d = 2, k(x, y)= 〈x, y〉 2 = 〈(x 2 1 , √ 2x1x2,x 2 2 ), (y 2 1 , √ 2y1y2,y 2 2 )〉, so Φ(x)=(x 2 1 , √ 2x1x2,x 2 2 ). Kernel PCA [2] often reveals low-dimensional representations of the dataset that reﬂect its underlying degrees of freedom. For ex- ample, in the synthetic “sculpture faces” dataset of Fig. 1, each face image is a highly nonlinear, but deterministic, function of three un- derlying variables: two pose angles and one lighting angle. A kernel PCA, performed with a well-chosen kernel function, is able to pick out two of these degrees of freedom as the ﬁrst two dimensions cho- sen in kernel PCA. (Note that an ordinary PCA will not.) The results of kernel PCA can thus reﬂect a type of nonlinear sparsity in the dataset. We could represent each image fairly accurately, knowing only its coordinates in this two-dimensional representation. Indeed, we may be able to build a better approximation of the image knowing its ﬁrst m coordinates in a nonlinearly sparse rep- resentation such as kernel PCA than we can knowing its largest m Fourier, wavelet or curvelet coefﬁcients. Fig. 1(c) shows a compari- son of the mean-squared error for an individual image of the “sculp- ture faces” dataset when approximated from m kernel PCA compo- nents vs. m wavelet coefﬁcients. The MSE decays much faster for kernel PCA components, showing that the image is more nonlinearly sparse than linearly sparse. Like this simple toy dataset, natural im- ages have been shown to be nonlinearly sparse: patches of natural images tend to lie along low-dimensional manifolds [1]. In view of this nonlinear sparsity, consider compressive sens- ing [3, 4]. Recently, compressive sensing has asserted that we can achieve perfect reconstruction of a signal with far fewer samples than Shannon-Nyquist traditionally requires, if the signal is approx- imately sparse in some basis. In fact, in practice, we can achieve a near-perfect, or even perfect, reconstruction, of the signal from about 5K measurements, each of which is the K-sparse signal’s in- ner product with a random vector. In this paper, we will show how the kernel trick can be used to adapt this paradigm of reconstructing a linearly sparse signal from a linear set of measurements to the case of reconstructing a non- linearly sparse signal from either nonlinear or linear measurements. The key idea is that a signal that is nonlinearly sparse can, with a proper choice of kernel, become linearly sparse in feature space, as our “sculpture faces” did above. We can thus reconstruct it from ran- dom measurements in feature space, which can be easily obtained from the usual random measurements for some kernels. Experimen- tally, we ﬁnd that when the signal to be reconstructed is nonlinearly sparse, our method reconstructs it from far fewer compressive sens- ing measurements, sometimes using an order of magnitude fewer measurements to achieve the same MSE. Section 2 outlines our recovery algorithm. Section 3 presents experimental results showing its power on sample datasets. Finally,