Boosting Contextual Information in Content-Based Image Retrieval Jaume Amores 1 , Nicu Sebe 2 , Petia Radeva 1 , Theo Gevers 2 , Arnold Smeulders 2 1 Computer Vision Center, UAB, Spain {jaume, petia}@cvc.uab.es 2 Univ. of Amsterdam, The Netherlands {nicu, gevers, smeulders}@science.uva.nl ABSTRACT We present a new framework for characterizing and retriev- ing objects in cluttered scenes. This CBIR system is based on a new representation describing every object taking into account the local properties of its parts and their mutual spatial relations, without relying on accurate segmentation. For this purpose, a new multi-dimensional histogram is used that measures the joint distribution of local properties and relative spatial positions. Instead of using a single descrip- tor for all the image, we represent the image by a set of histograms covering the object from different perspectives. We integrate this representation in a whole framework which has two stages. The first one is to allow an efficient retrieval based on the geometric properties (shape) of objects in im- ages with clutter. This is achieved by i) using a contextual descriptor that incorporates the distribution of local struc- tures, and ii) taking a proper distance that disregards the clutter of the images. At a second stage, we introduce a more discriminative descriptor that characterizes the parts of the objects by their color and their local structure. By using relevant-feedback and boosting as a feature selection algorithm, the system is able to learn simultaneously the information that characterize each part of the object along with their mutual spatial relations. Results are reported on two known databases and are quantitatively compared to other successful approaches. Categories and Subject Descriptors: H.3.3 [Informa- tion Storage and Retrieval]: Information Search and Re- trieval General Terms: Algorithms. Keywords: Content-Based Image Retrieval, Object Recog- nition, Contextual Information, Boosting 1. INTRODUCTION Given the large amount of information available in the form of digital images, it becomes critical to develop sys- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR’04, October 15–16, 2004, New York, New York,USA. Copyright 2004 ACM 1-58113-940-3/04/0010 ...$5.00. tems that automatically organize and retrieve images based on their content. In this work, we present an object-based retrieval approach. An important difference with a general object class recognition approach is that achieving fast time responses is a major goal when the scope is to perform re- trieval (i.e. interact with the user). We regard an object as a collection of parts and their mu- tual spatial relations. In this sense, the representation of the image must take into account local information charac- terizing the parts, and contextual information characterizing what is the context of each part (how the rest of the parts are spatially related to it). In retrieval of objects, many authors rely on a segmentation of the image into blobs [3, 4, 18]. Local information is used by extracting a set of descriptors from each blob, which very often with the current segmen- tation techniques does not represent the whole object. A classical contextual representation is the “Attributed Rela- tional Graph” (ARG) [9, 12]. This descriptor represents the parts as nodes and their spatial relationships as arcs. If we obtain the parts by segmenting the image into blobs, this de- scriptor is not appropriate for complex images. The reason is that with the state-of-the art segmentation, the number of blobs and their spatial distribution are not constant across different images of the same object. Instead of using a discrete descriptor (e.g. ARG) many authors take into account local properties along with their spatial relations through multi-dimensional histograms [11, 13, 19, 7]. These authors use generalizations of the color his- togram in order to take into account the color of the pixels and their spatial relations. The difference is that they use different spatial relations, restricting or not the relations to be between pixels of the same color, and using all the pixels or just pixels forming edges. The common feature of their approaches is that they use a single histogram for the whole image. The context around every part of the image is then mixed up and blurred by aggregating the relationships into one final spatial histogram. This does not allow to represent the different points of view of the object, i.e. how the con- text is represented around different parts. Moreover, by this approach the background is aggregated into the descriptor making it not robust in cluttered scenes. Furthermore, us- ing color as the only local information restricts the search to objects that have the same color as the query. This prevents general object class recognition because, for example, black cars cannot be matched with red cars. Belongie et al. [1] consider the use of several spatial his- tograms for describing the object. They developed a de- 31