A Graph-Matching Kernel for Object Categorization Olivier Duchenne 1,2,3 Armand Joulin 1,2,3 Jean Ponce 2,3 1 INRIA 2 ´ Ecole Normale Sup´ erieure de Paris Abstract This paper addresses the problem of category-level im- age classiﬁcation. The underlying image model is a graph whose nodes correspond to a dense set of regions, and edges reﬂect the underlying grid structure of the image and act as springs to guarantee the geometric consistency of nearby regions during matching. A fast approximate algorithm for matching the graphs associated with two images is pre- sented. This algorithm is used to construct a kernel appro- priate for SVM-based image classiﬁcation, and experiments with the Caltech 101, Caltech 256, and Scenes datasets demonstrate performance that matches or exceeds the state of the art for methods using a single type of features. 1. Introduction Explicit correspondences between local image features are a key element of image retrieval [30] and speciﬁc ob- ject detection [29] technology, but they are seldom used [3, 13, 16, 35] in object categorization, where bags of fea- tures (BOFs) and their variants [4, 7, 8, 10, 26, 38, 39] have been dominant. However, as shown by Caputo and Jie [6], feature correspondences can be used to construct an image comparison kernel [35] that, although not positive deﬁnite, is appropriate for SVM-based classiﬁcation, and often outperforms BOFs on standard datasets such as Cal- tech 101 in terms of classiﬁcation rates. This is the ﬁrst motivation for the approach to object categorization pro- posed in the rest of this presentation. Our second moti- vation is that image representations that enforce some de- gree of spatial consistency–such as HOG models [8], spatial pyramids [26], and their variants, e.g. [4, 38]–usually per- form better in image classiﬁcation tasks than pure bags of features that discard all spatial information. This suggests adding spatial constraints to pure appearance-based match- ing and thus formulating object categorization as a graph matching problem where a unary potential is used to select matching features, and a binary one encourages nearby fea- 3 WILLOW project-team, Laboratoire d’Informatique de l’Ecole Nor- male Sup´ erieure, ENS/INRIA/CNRS UMR 8548. tures in one image to match nearby features in the second one. Concretely, we propose to represent images by graphs whose nodes and edges represent the regions associated with a coarse image grid and their adjacency relationships. The problem of matching two images is formulated as the optimization of an energy akin to a ﬁrst-order multi-label Markov random ﬁeld (MRF), 4 deﬁned on the corresponding graphs, the labels corresponding to node assignments. Vari- ants of this formulation have been used in problems ranging from image restoration, to stereo vision, and object recog- nition. However, as shown by a recent comparison [23], its performance in image classiﬁcation tasks has been, so far, a bit disappointing. As further argued in the next section, this may be due in part to the fact that current approaches are too slow to support the use of sophisticated classiﬁers such as support vector machines (SVMs). In contrast, this paper makes three original contributions: 1. Generalizing [6, 35] to graphs, we propose in Section 2 to use the value of the optimized MRF associated with two images as a (non positive deﬁnite) kernel, suitable for SVM classiﬁcation. 2. We propose in Section 3 a novel extension of Ishikawa’s method [20] for optimizing the MRF which is orders of magnitude faster than competing algorithms (e.g., [23, 25, 27] for the grids with a few hundred nodes considered in this paper). In turn, this allows us to combine our kernel with SVMs in image classiﬁcation tasks. 3. We demonstrate in Section 4 through experiments with standard benchmarks (Caltech 101, Caltech 256, and Scenes datasets) that our method matches and in some cases exceeds the state of the art for methods using a single type of features. 1.1. Related work Early “appearance-based” approaches to image retrieval and object recognition, such as color histograms, eigenfaces or appearance manifolds, used global image descriptors to match images. Schmid and Mohr [30] proposed instead 4 As is often the case in computer vision applications, our use of the MRF notion here is slightly abusive since our formulation does not require or assume any probabilistic modeling. 1