A Graph-Matching Kernel for Object Categorization Olivier Duchenne 1,2,3 Armand Joulin 1,2,3 Jean Ponce 2,3 1 INRIA 2 ´ Ecole Normale Sup´ erieure de Paris Abstract This paper addresses the problem of category-level im- age classification. The underlying image model is a graph whose nodes correspond to a dense set of regions, and edges reflect the underlying grid structure of the image and act as springs to guarantee the geometric consistency of nearby regions during matching. A fast approximate algorithm for matching the graphs associated with two images is pre- sented. This algorithm is used to construct a kernel appro- priate for SVM-based image classification, and experiments with the Caltech 101, Caltech 256, and Scenes datasets demonstrate performance that matches or exceeds the state of the art for methods using a single type of features. 1. Introduction Explicit correspondences between local image features are a key element of image retrieval [30] and specific ob- ject detection [29] technology, but they are seldom used [3, 13, 16, 35] in object categorization, where bags of fea- tures (BOFs) and their variants [4, 7, 8, 10, 26, 38, 39] have been dominant. However, as shown by Caputo and Jie [6], feature correspondences can be used to construct an image comparison kernel [35] that, although not positive definite, is appropriate for SVM-based classification, and often outperforms BOFs on standard datasets such as Cal- tech 101 in terms of classification rates. This is the first motivation for the approach to object categorization pro- posed in the rest of this presentation. Our second moti- vation is that image representations that enforce some de- gree of spatial consistency–such as HOG models [8], spatial pyramids [26], and their variants, e.g. [4, 38]–usually per- form better in image classification tasks than pure bags of features that discard all spatial information. This suggests adding spatial constraints to pure appearance-based match- ing and thus formulating object categorization as a graph matching problem where a unary potential is used to select matching features, and a binary one encourages nearby fea- 3 WILLOW project-team, Laboratoire d’Informatique de l’Ecole Nor- male Sup´ erieure, ENS/INRIA/CNRS UMR 8548. tures in one image to match nearby features in the second one. Concretely, we propose to represent images by graphs whose nodes and edges represent the regions associated with a coarse image grid and their adjacency relationships. The problem of matching two images is formulated as the optimization of an energy akin to a first-order multi-label Markov random field (MRF), 4 defined on the corresponding graphs, the labels corresponding to node assignments. Vari- ants of this formulation have been used in problems ranging from image restoration, to stereo vision, and object recog- nition. However, as shown by a recent comparison [23], its performance in image classification tasks has been, so far, a bit disappointing. As further argued in the next section, this may be due in part to the fact that current approaches are too slow to support the use of sophisticated classifiers such as support vector machines (SVMs). In contrast, this paper makes three original contributions: 1. Generalizing [6, 35] to graphs, we propose in Section 2 to use the value of the optimized MRF associated with two images as a (non positive definite) kernel, suitable for SVM classification. 2. We propose in Section 3 a novel extension of Ishikawa’s method [20] for optimizing the MRF which is orders of magnitude faster than competing algorithms (e.g., [23, 25, 27] for the grids with a few hundred nodes considered in this paper). In turn, this allows us to combine our kernel with SVMs in image classification tasks. 3. We demonstrate in Section 4 through experiments with standard benchmarks (Caltech 101, Caltech 256, and Scenes datasets) that our method matches and in some cases exceeds the state of the art for methods using a single type of features. 1.1. Related work Early “appearance-based” approaches to image retrieval and object recognition, such as color histograms, eigenfaces or appearance manifolds, used global image descriptors to match images. Schmid and Mohr [30] proposed instead 4 As is often the case in computer vision applications, our use of the MRF notion here is slightly abusive since our formulation does not require or assume any probabilistic modeling. 1