Y. Zhuang et al. (Eds.): PCM 2006, LNCS 4261, pp. 835 – 843, 2006.
© Springer-Verlag Berlin Heidelberg 2006
Compact Representation for Large-Scale Clustering
and Similarity Search
Bin Wang
1
, Yuanhao Chen
1
, Zhiwei Li
2
, and Mingjing Li
2
1
University of Science and Technology of China
2
Microsoft Research Asia
{binwang, yhchen04}@ustc.edu,
{zli, mjli}@microsoft.com
Abstract. Although content-based image retrieval has been researched for many
years, few content-based methods are implemented in present image search
engines. This is partly bacause of the great difficulty in indexing and searching
in high-dimensional feature space for large-scale image datasets. In this paper,
we propose a novel method to represent the content of each image as one or
multiple hash codes, which can be considered as special keywords. Based on this
compact representation, images can be accessed very quickly by their visual
content. Furthermore, two advanced functionalities are implemented. One is
content-based image clustering, which is simplified as grouping images with
identical or near identical hash codes. The other is content-based similarity
search, which is approximated by finding images with similar hash codes. The
hash code extraction process is very simple, and both image clustering and
similarity search can be performed in real time. Experiments on over 11 million
images collected from the web demonstrate the efficiency and effectiveness of
the proposed method.
Keywords: similarity search, image clustering, hash code.
1 Introduction
Image is one of the most popular media types in our daily life. With the profusion of
digital cameras and camera cell phones, the number of images, including personal photo
collections and web image repositories, increases quickly in recent years. Therefore,
people will find desired images on the web. To meet those needs, many image search
engines have been developed and are commercially available. For instance, both
Google Image Search [1] and Yahoo [2] have indexed over one billion images.
Present image search engines generally accept only keyword-based query, while
very few simple content-based methods are supported recently. Google and Yahoo
allow the categorization of images according to their sizes (large, middle and small) or
colors (black/white vs. color images). Fotolia [3] provides limited support to search
images based on their colors, which is rough and far insufficient for the images.
With the fact that image is a kind of visual medium, content-based image retrieval
(CBIR) has been well studied and many CBIR algorithms have been proposed.
ImageRover, RIME and WeebSeer are among the early content-based image retrieval