J. A. Black, Jr., M. Phielipp, G. Nielson, and S. Panchanathan, "Can the high-level content of natural images be indexed using local analysis?," presented at the Human Vision and Electronic Imaging Conference (HVEI 2004), San Jose, CA, 2004. Can the high-level content of natural images be indexed using local analysis? John A. Black, Jr*, Mariano Phielipp, Greg Nielson, Sethuraman Panchanathan Center for Cognitive Ubiquitous Computing, Arizona State University, Tempe, AZ 85287 ABSTRACT Early methods of image indexing relied heavily on color histograms, which characterize the global content of images. However, global indexing methods proved to be unsatisfactory, and researchers now employ more localized measures of image content, based on relatively small regions. At the same time, it has also become clear that image indexing should be based on higher-level visual content. This raises an important question: “Can the higher-level content of images be reliably indexed using local analysis?” In general, humans are better at indexing mid-level and high-level visual content than today’s automated indexing algorithms. Therefore, it makes sense to ascertain how well humans can perform mid- level or high-level indexing, based on small regions. This paper describes research that employs a set of outdoor scenery images (called the NaturePix image set) to compare how successfully humans can label the visual content of small regions of natural images when (1) these regions are seen in the context of the larger image, and (2) when these regions are extracted from (and are seen in isolation from) that larger image. The results of these experiments indicate what types of higher-level image content can be recognized locally, and how successfully high-level image content can be indexed on the basis of local feature analysis. Keywords: Content based image retrieval, Image indexing, Semantic indexing, Lexical basis functions, Image content, Visual content, Semantic content, NaturePix, Local content analysis, Feature detectors 1. INTRODUCTION Query by example image retrieval uses feature vectors to represent the visual content of each image. The feature vector of the query image is then compared to the feature vectors of the images in the database, to find the best matches. Historically, these feature vectors have been based on low-level content, such as color or spatial frequency content. The resulting retrievals have been less than satisfactory. Since humans tend to perceive image content at the semantic level, many content-based image retrieval researchers have focused on how to extract semantic content from images for indexing purposes. However, to date there has been no universal agreement on what is meant by the term semantic content, or how it can be extracted from images. This paper presents a taxonomy for the various levels of content, describes how visual content words might be used as a basis for classifying low-level semantic content, and describes a method for identifying the types of low-level semantic content that might be amenable to the design of automatic detectors that could index images based on their semantic content. 2. BACKGROUND AND RELATED WORK Early image retrieval systems were essentially databases that relied on manually attached textual annotations. Some of these textual annotations were unrelated to the visual content of the images (providing information such as the location and/or the date at which the image was captured) while other annotations provided information about the visual content – albeit typically using a set of non-standardized keywords. This approach was able to provide image retrieval based on high-level content, such as objects, people, or emotions depicted in the image. However, manual annotation is a tedious process, and the quality of the retrieval system is highly dependent upon the quality and consistency of the manual annotations. To provide a better solution to the retrieval problem researchers have attempted to develop methods for automatically indexing images to permit content-based image retrieval. 2.1 Indexing images based on global analysis Some of the earliest efforts employed global histograms to index the content of images, and this approach is still being researched and refined. For example, Pass et. al. 1 proposed a refined method for histogramming that partitions each histogram bucket based on spatial coherence. Brunelli et. al. 2 evaluated various similarity measures for color and luminance histograms, in an effort to find the “best” measure, and determined that high bin count does not significantly enhance histogram similarity measures. Nephade et al. 3 used a variety of non-local features, including color histograms, to characterize the content of video images, and Siggelkow 4 developed a method for histogram anti-aliasing, by using a weighted assignment to several neighboring bins.