FINDING OBJECTS IN IMAGE DATABASES BY GROUPING zyx J. Malik, D.A. Forsyth, M.M. Fleckf, H. Greenspan$, T. Leung, zyxwvu C. Carson, S. Belongie 8 C. Bregler zyxw Computer Science Division, University of California at Berkeley, Berkeley CA 94720 ABSTRACT Retrieving images from very large collections, using image content as a key, is becoming an important problem. Find- ing objects in image databases is a big challenge in the field. This paper describes our approach to object recogni- tion, which is distinguished by: a rich involvement of early visual primitives, including color and texture; hierarchical grouping and learning strategies in the classification pro- cess; the ability to deal with rather general objects in un- controlled configurations and contexts. We illustrate these properties with three case-studies: one demonstrating the use of color and texture descriptors; one learning scenery concepts using grouped features; and one demonstrating a possible application domain in detecting naked people in a scene. 1. INTRODUCTION Very large collections of images are becoming common, and users have a clear preference for accessing images in these databases based on the objects that are present in them. Creating indices for these collections by hand is unlikely to be successful, because these databases can be gigantic. Furthermore, it can be very difficult to impose order on these collections. For example, the California Department of Water Resources (DWR) collection contains of the order of half-a-million images; a subset of this collection can be searched at zyxwvutsrq http://elib.cs.berkeley.edu. Another ex- ample is the collection of images available on the Internet, which is notoriously large and disorderly. Classical object recognition techniques from computer vision cannot help with this problem. Recent techniques can identify specific objects drawn from a small (of the order of 100) collection, but no present technique is effective at the general classifi- cation task. In this short paper we will not attempt to cover all the related literature in the field (e.g. [1,2]). For a com- plete reference and comparison among the current systems in handling image databases, please refer to zyxwvutsrq [3]. This paper presents case studies illustrating an approach to determining image content that is capable of object clas- sification. Our approach is to construct a sequence of suc- cessively abstract descriptors, at an increasingly high level, through a hierarchy of grouping and learning processes. At the lowest level, grouping is based on spatiotemporal co- herence of local image descriptors-color, texture, disparity, motion-with contours and junctions extracted simultane- ously to organize these groupings. At the next stage, the ?Department of Computer Science, University of Iowa, Iowa Salso with the Dept. of Electrical Engineering, CALTECH, City, IA 52240 Pasadena CA 91125 0-7803-3258-X/96/$5.00 0 1996 IEEE assumptions that need to be invoked are more global (in terms of size of image region) as well as more class-specific. Slogans characterizing this approach are: groupzng proceeds from the local to the global; and groupzng proceeds from zn- vokzng generic assumptions to more specific ones. We see three major issues: 1. Segmenting images into coherent regions based on integrated region and contour descriptors: An important stage in iden- tifying objects is deciding which image regions come from particular objects. This is simple when objects are made of stuff of a single, fixed color; however, most objects are cov- ered with textured stuff, where the spatial relationships be- tween colored patches are important. The content-based re- trieval literature contains zyxw ,a wide variety of examples of the usefulness of quite simple descriptions in describing images and objects. Color histograms are a particularly popular example; however, color histograms lack spatial cues, and so must confuse, for example, the English and the French flags. In what follows (section 2), we show three important cases: in the first, features extracted from the orientation- histogram of the image we used for the extraction of co- herent texture regions. In the second, the observation that a region of stuff is due to the periodic repetition of a sim- ple tile yields information about the original tile, and the repetition process. Finally, measurements of the size and number of small blobs of color yield information about stuff regions - such as fields of f€owers - that cannot be obtained from color histograms alone. 2. Learning as a methodology for developing the relationship between object classes and color, tex- ture and shape descriptors: Given the color, texture and shape descriptors for a set of labeled objects, one can use machine learning techniques to train a classifier. In section 3, we show results obtained using a decision tree classifier that was trained to distinguish among a number of visual concepts that are common in our image database. A novel aspect of this work is the use of grouping as part of the process of constructing, the descriptors, instead of using simple pixel-level feature vectors. Interestingly, the output of a classifier can itself be used to guide higher level group- ing. While this work is preliminary, it does suggest a way to make less tedious the processes of acquiring object models and developing class-basecl grouping strategies. Classifying objects based on primitive de- scriptions and relationships between primitives: Once regions have been described as primitives, the relationships between primitives become important. Finding people or animals in images is essentially a process of finding regions corresponding to segments and then assembling those seg- ments into limbs and girdles. This process involves explor- 3. zyxwvu 76 1