Large-Scale Feature Learning With Spike-and-Slab Sparse Coding Ian J. Goodfellow goodfeli.@iro.umontreal.ca Aaron Courville Aaron.Courville@umontreal.ca Yoshua Bengio Yoshua.Bengio@umontreal.ca DIRO, Universit´ e de Montr´ eal, Montr´ eal, Qu´ ebec, Canada Abstract We consider the problem of object recogni- tion with a large number of classes. In or- der to overcome the low amount of labeled examples available in this setting, we in- troduce a new feature learning and extrac- tion procedure based on a factor model we call spike-and-slab sparse coding (S3C). Prior work on S3C has not prioritized the abil- ity to exploit parallel architectures and scale S3C to the enormous problem sizes needed for object recognition. We present a novel inference procedure for appropriate for use with GPUs which allows us to dramatically increase both the training set size and the amount of latent factors that S3C may be trained with. We demonstrate that this ap- proach improves upon the supervised learn- ing capabilities of both sparse coding and the spike-and-slab Restricted Boltzmann Ma- chine (ssRBM) on the CIFAR-10 dataset. We use the CIFAR-100 dataset to demonstrate that our method scales to large numbers of classes better than previous methods. Fi- nally, we use our method to win the NIPS 2011 Workshop on Challenges In Learning Hierarchical Models’ Transfer Learning Chal- lenge. 1. Introduction We consider here the problem of unsupervised feature discovery for supervised learning. In supervised learn- ing, one is given a set of examples V = {v (1) ,...,v (m) } and associated labels {y (1) ,...,y (m) }. The goal is to learn a model p(y | v) so that new labels can be pre- dicted from new unlabeled examples v. Appearing in Proceedings of the 29 th International Confer- ence on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). The idea behind unsupervised feature discovery is that the final learning problem can become much easier if the problem is represented in the right way. By learn- ing the structure of V we can discover a feature map- ping φ(v) that can be used to preprocess the data prior to running a standard supervised learning algorithm, such as an SVM. There has been a great deal of recent interest in in- vestigating different unsupervised learning schemes to train φ from V . In particular, the goal of deep learn- ing (Bengio, 2009) is to learn a function φ that con- sists of many layers of processing, each of which re- ceives the previous layers as input and incrementally disentangles the factors of variation in the data. Deep learning systems are usually created by composing to- gether several shallow unsupervised feature learners. Examples of shallow models applied to feature dis- covery include sparse coding (Raina et al., 2007), re- stricted Boltzmann machines (RBMs) (Hinton et al., 2006; Courville et al., 2011b), various autoencoder- based models (Bengio et al., 2007), and hybrids of autoencoders and sparse coding (Kavukcuoglu et al., 2010). In the context of probabilistic generative mod- els, such as the RBM, φ(v) is typically taken to be the conditional expectation of the latent variables, and the process of learning φ consists simply of fitting the gen- erative model to V . Single-layer convolutional models based on simple fea- ture extractors currently achieve state-of-the-art per- formance on the CIFAR-10 object recognition dataset (Coates and Ng, 2011; Jia and Huang, 2011). It is known that the best models for the detection layer of the convolutional model do not perform well when fewer labeled examples are available (Coates and Ng, 2011). In particular, sparse coding outperforms a sim- ple thresholded linear feature extractor when the num- ber of labeled examples decreases. Our objective is to further improve performance when the number of la- beled examples is low by introducing a new feature extraction procedure based on spike-and-slab sparse