Y. Zhuang et al. (Eds.): PCM 2006, LNCS 4261, pp. 835 843, 2006. © Springer-Verlag Berlin Heidelberg 2006 Compact Representation for Large-Scale Clustering and Similarity Search Bin Wang 1 , Yuanhao Chen 1 , Zhiwei Li 2 , and Mingjing Li 2 1 University of Science and Technology of China 2 Microsoft Research Asia {binwang, yhchen04}@ustc.edu, {zli, mjli}@microsoft.com Abstract. Although content-based image retrieval has been researched for many years, few content-based methods are implemented in present image search engines. This is partly bacause of the great difficulty in indexing and searching in high-dimensional feature space for large-scale image datasets. In this paper, we propose a novel method to represent the content of each image as one or multiple hash codes, which can be considered as special keywords. Based on this compact representation, images can be accessed very quickly by their visual content. Furthermore, two advanced functionalities are implemented. One is content-based image clustering, which is simplified as grouping images with identical or near identical hash codes. The other is content-based similarity search, which is approximated by finding images with similar hash codes. The hash code extraction process is very simple, and both image clustering and similarity search can be performed in real time. Experiments on over 11 million images collected from the web demonstrate the efficiency and effectiveness of the proposed method. Keywords: similarity search, image clustering, hash code. 1 Introduction Image is one of the most popular media types in our daily life. With the profusion of digital cameras and camera cell phones, the number of images, including personal photo collections and web image repositories, increases quickly in recent years. Therefore, people will find desired images on the web. To meet those needs, many image search engines have been developed and are commercially available. For instance, both Google Image Search [1] and Yahoo [2] have indexed over one billion images. Present image search engines generally accept only keyword-based query, while very few simple content-based methods are supported recently. Google and Yahoo allow the categorization of images according to their sizes (large, middle and small) or colors (black/white vs. color images). Fotolia [3] provides limited support to search images based on their colors, which is rough and far insufficient for the images. With the fact that image is a kind of visual medium, content-based image retrieval (CBIR) has been well studied and many CBIR algorithms have been proposed. ImageRover, RIME and WeebSeer are among the early content-based image retrieval