1 Object-speciﬁc Features, Biological Vision and Real World Object Recognition Thomas Serre, Jennifer Louie, Maximilian Riesenhuber [CBCL, with Prof. T. Poggio] The Problem: We propose a biologically plausible model of object recognition in cortex that handles a real-world face detection task at the level of state-of-the-art machine vision systems. Motivation: Models of object recognition in cortex have been mostly applied to tasks involving the recognition of isolated objects presented on blank backgrounds. Ultimately models of the visual sys- tem have to prove themselves in real world object recognition tasks, such as face detection in cluttered scenes, a standard computer vision benchmark task. For such tasks, recent advances in machine vision have shown the beneﬁt of image representations based on target object class-speciﬁc features [1]. We here wish to explore the learning of object class- speciﬁc features in intermediate stages of a model of object recognition in cortex recently presented [3], and to test its performance on a face detection task. Previous Work: We propose an extension of the HMAX model of object recognition in cortex [3] that characterizes the ventral visual pathway in cortex, extending from primary visual cortex, V1, to infer- otemporal cortex, IT, a brain area thought to be crucial for object recognition. The model consists of a hierarchy of layers with two different types of pooling mechanisms (linear operations “S”, to build more complex features from simple ones and nonlinear MAX pooling operations, “C”, to increase the invariance of units to stimulus scaling and translation, see ﬁgure). The model explains how so called view-tuned neurons in IT can exhibit highly speciﬁc tuning to views of complex objects while showing invariance to changes in stimulus position and scale. In the model, object-speciﬁc learning so far only occurs in the higher levels. We found that the model performed rather poorly on a face detection task, due to the low speciﬁcity of the hardwired feature set of C2 units in the model (corresponding to neurons in intermediate visual area V4) that do not show any particular tuning for faces vs. background. We extended the previous model and showed how visual features of intermediate complexity can be learned in HMAX using a simple learning rule [4]. Approach: Input images are ﬁrst ﬁltered through a continuous layer S1 of overlapping simple cell-like receptive ﬁelds (ﬁrst derivative of gaussians) at different scales and orientations. Neighboring S1 cells in turn are pooled by C1 cells through a MAX operation. The difference to standard HMAX lies in the C1 S2 connectivity: While in standard HMAX these connections are hardwired to produce 256 combinations of C1 input, they are now learned from the data. S2 units are RBF-like units centered on features that were obtained by performing vector quantization (VQ, using the k-means algorithm) over randomly chosen pattern of C1 activation extracted at random position over face images. Given a certain patch size p, a feature corresponds to a pattern of C1 activation , where the last comes from the four different preferred orientations of C1 units. On top of the system, C2 cells perform a MAX operation over the whole visual ﬁeld and provide the ﬁnal encoding of the stimulus, constituting the input to an SVM classiﬁer. Extensive comparisons between computer vision systems [1, 5] have shown that HMAX with feature learning handles a face detection task at the level of state of the art classiﬁer [4]. Also, we showed that a simple feature selection technique that is biologically plausible (maximally activated S2 units) would allow unsupervised feature learning from both faces and non-faces parts while maintaining high level performances [2]. Impact: Feature learning in a hierarchy is a difﬁcult computational problem, and so is face detection in natural images. Using a simple rule to learn object-speciﬁc features, HMAX performs at the level of classical machine vision face detection systems presented in the literature. This suggests an important role for the set of features in intermediate visual areas in object recognition. Moreover features are not chosen according to their discriminative power for any classiﬁcation task (between-class discrimination) but rather for their within-class representativeness. We expect the same features to be used for other recognition tasks, however, their weight in the decision task might vary from one to another. Future Work: We plan to look at model performances withe respect to non-afﬁne transformations such as rotation in depth and illumination changes. Future work will also include a comparison with humans on a face detection task and extension to other object classes.