Image Retrieval: Feature Primitives, Feature Representation, and Relevance Feedback Xiang Sean Zhou, Thomas S. Huang Beckman Institute for Advanced Science and Technology University of Illinois at Urbana Champaign, Urbana, IL 61801, USA {xzhou2, huang}@ifp.uiuc.edu Abstract In this paper we review the feature selection and representation techniques in CBIR systems, and propose a unified feature representation paradigm. We revise our previously proposed water-filling edge features with newly proposed primitives and present them using this unified feature formation paradigm. Multi-scale feature formation is proposed to support cross-resolution image matching. Sub-image feature extraction is applied for regional matching. Relevance feedback as an on-line learning mechanism is adopted for feature and tile selection and weighting during the retrieval. We discuss in detail the revised water-filling edge features, cross- scale feature extraction and image matching, and relevance feedback on regional/tile-based matching. 1. Introduction The performance of a CBIR system is inherently constrained by the features adopted to represent the images in the database. The most frequently referred “visual contents” are color, texture, and shape[2]. If we regard information embedded in a digital image as chrominance information combined with illuminance information, then color feature captures the chrominance information; and both texture and shape represent the illuminance information. Texture features (e.g. co- occurrence features[4] and Wavelet based features[8]) and shape features (e.g. Fourier descriptors[15] and moment invariants[5]) have been applied extensively in image retrieval systems. Even though lack a formal definition, “texture features” can be described as the features that captures spatial distribution of illuminance variations in terms of “repeating patterns”. Whereas shape represents a specific edge feature that is related to object contour. However, in most real-world applications, shape features are not applicable since a meaningful segmentation is unachievable. People also tried various other features such as edge densities, edge directions, turning angles, co- linearity, salient points, etc., in general or domain-specific applications. Even though there have been efforts to classify these features into the “shape” category, it is obvious that this is inappropriate—e.g., it is hard to relate the concept of “edge density” to the concept of “shape”. So we would rather regard “shape” as a special feature component of a much broader category of “structural features”, which captures information represented in non- repeating illuminance patterns, or more specifically, edge patterns in general[16]. Under this categorization scheme, one can think of “structural features” as the features capturing the edge information in the image, while “shape” is just a special case only capturing the (outer?) edge of some object(s) in the image. As we try to answer the question of “how to represent information embedded in the edges” instead of “how to represent information of the object shape”, we are left with much more freedom and flexibility in terms of feature formation. In this paper we try to use a unified feature representation paradigm to illustrate the process of feature formation for images and apply it to guide the feature formation of our proposed water-filling features. The detailed discussion is presented in Section 2 and 3. As we include cross-scale feature extraction into the general feature formation paradigm, cross-scale image matching becomes possible. Section 4 presents the results on cross-scale matching. If we allow either rough region segmentation or straightforward tiling on the image, regional/tile-based features can be formed together with global image features, and an on-line learning scheme based on the user relevance feedback can be applied to automatically determine the relative importance of the tile-based features versus the global features. An experiment is described in Section 5. 2. Unified Feature Representation Paradigm Most of the existing image features can be regarded as constructed in the following two steps: 1. Feature primitives are selected based on the original image, which will retain useful information embedded in the image. Certain transformation can be applied prior to the primitive extraction; 2. A compact representation is chosen to capture information carried by the feature primitives. Most of these representations are of statistical nature. Table I shows some examples of feature formations, including color moments, color histogram, wavelet moments for texture, co-occurrence matrix for texture, and water-filling features for structure. Initially, information is carried in the pixel values—both color and intensity. These can serve as feature primitives themselves. Or feature primitive can be the output of some lossless or lossy transformation, such as DCT, wavelet Transform, or edge detection followed by water- filling operation.