Object-Speciﬁc Features for Real World Object Recognition in Biological Vision Thomas Serre, Jennifer Louie & Maximilian Riesenhuber Artiﬁcial Intelligence Laboratory and The Center for Biological and Computational Learning Massachusetts Institute of Technology Cambridge, Massachusetts 02139 http://www.ai.mit.edu @ MIT The Problem: We propose a biologically plausible model of object recognition in cortex that is able to handle a real-world face detection task at the level of state-of-the-art machine vision systems. Motivation: Models of object recognition in cortex have so far been mostly applied to tasks involving the recog- nition of isolated objects presented on blank backgrounds. However, ultimately models of the visual system have to prove themselves in real world object recognition tasks, such as face detection in cluttered scenes, a standard computer vision benchmark task. Recent advances in machine vision have shown the beneﬁt of image representa- tions based on target object class-speciﬁc features for such tasks [1]. We here wish to explore the learning of object class-speciﬁc features in intermediate stages in our recently presented model of object recognition in cortex [2], and to test its performance on a face detection task. Previous work: We propose an extension of the HMAX model of object recognition in cortex [2]. The model, consisting of a hierarchy of layers with two different types of pooling mechanisms (linear operations “S”, to build more complex features from simple ones and nonlinear MAX pooling operations, “C”, to increase the invariance of units to stimulus scaling and translation, see ﬁgure), is a model of the ventral visual pathway in cortex, extending from primary visual cortex, V1, to inferotemporal cortex, IT, a brain area thought to be crucial for object recognition. The model explains how so called view-tuned neurons in IT can exhibit highly speciﬁc tuning to views of complex objects while showing invariance to changes in stimulus position and scale. In the model, object-speciﬁc learning so far only occurs in the higher levels. We found that the model performed rather poorly on a face detection task, due to the low speciﬁcity of the hardwired feature set of C2 units in the model (corresponding to neurons in intermediate visual area V4) that do not show any particular tuning for faces vs. background. We extended the previous model and showed how visual features of intermediate complexity can be learned in HMAX using a simple learning rule. Approach: Patterns on the model “retina” (100 × 100 gray images) are ﬁrst ﬁltered through a continuous layer S1 of overlapping simple cell-like receptive ﬁelds (ﬁrst derivative of gaussians) at different scales and orientations. Neighboring S1 cells in turn are pooled by C1 cells through a MAX operation. The difference to standard HMAX lies in the C1→S2 connectivity: While in standard HMAX these connections are hardwired to produce 256 2 × 2 combinations of C1 input, they are now learned from the data. Given a certain patch size p, a feature corresponds to a p × p × 4 pattern of C1 activation w, where the last 4 comes from the four different preferred orientations of C1 units. S2 units are RBF-like units centered on prototypes u that were obtained by performing vector quantization (VQ, using the k-means algorithm) over randomly chosen pattern of C1 activation w extracted at random position over face images. On top of the system, C2 cells perform a MAX operation over the whole visual ﬁeld and provide the ﬁnal encoding of the stimulus, constituting the input to an SVM classiﬁer. Difﬁculty: Feature learning in a hierarchy is a difﬁcult computational problem, and so is face detection in natural images. Impact: Using a simple rule to learn object-speciﬁc features, HMAX outperforms a classical machine vision face detection system presented in the literature. This suggests an important role for the set of features in intermediate visual areas in object recognition. Moreover features are not chosen according to their discriminative power for any classiﬁcation task (between-class discrimination) but rather for their within-class representativeness. We expect the same features to be used for other recognition tasks, however, their weight in the decision task might vary from one 122