IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002 235 The GC-Tree: A High-Dimensional Index Structure for Similarity Search in Image Databases Guang-Ho Cha and Chin-Wan Chung Abstract—With the proliferation of multimedia data, there is an increasing need to support the indexing and retrieval of high-di- mensional image data. In this paper, we propose a new dynamic index structure called the GC-tree (or the grid cell tree) for efficient similarity search in image databases. The GC-tree is based on a special subspace partitioning strategy which is optimized for a clus- tered high-dimensional image dataset. The basic ideas are three- fold: 1) we adaptively partition the data space based on a density function that identifies dense and sparse regions in a data space; 2) we concentrate the partition on the dense regions, and the objects in the sparse regions of a certain partition level are treated as if they lie within a single region; and 3) we dynamically construct an index structure that corresponds to the space partition hierarchy. The resultant index structure adapts well to the strongly clustered dis- tribution of high-dimensional image datasets. To demonstrate the practical effectiveness of the GC-tree, we experimentally compared the GC-tree with the IQ-tree, the LPC-file, the VA-file, and the linear scan. The result of our experiments shows that the GC-tree outperforms all other methods. Index Terms—Dynamic index structure, GC-tree, high-dimen- sional indexing, image database, nearest neighbor search (NN search), similarity search. I. INTRODUCTION S IMILARITY search in high-dimensional image databases is an interesting and important, but difficult problem. The most typical type of similarity search is the -nearest neighbor ( -NN) search. The traditional -NN problem is defined as fol- lows. Consider a database DB consisting of points from , where . Each usually consists of ei- ther integers or floats. A -NN query consists of a point and a positive integer . The -NN search finds the nearest neighbors of with respect to a distance function . The output set consists of points from the database such that and The actual problem in image database applications is how to process such queries so that the nearest objects can be returned within the desired response time. Therefore, our focus is the Manuscript received April 21, 2001; revised February 26, 2002. This work was supported by the University Fundamental Research Program of Ministry of Information and Communication in the Republic of Korea under Grant 2001-116-3. The associate editor coordinating the review of this paper and approving it for publication was Dr. Sankar Basu. G.-H. Cha is with the Department of Multimedia Science, Sookmyung Women’s University, Seoul 140-742, South Korea (e-mail: ghcha@sook- myung.ac.kr). C.-W. Chung is with the Department of Computer Science, Korea Advanced Institute of Science and Technology, Taejon 305-701, South Korea (e-mail: chungcw@islab.kaist.ac.kr). Publisher Item Identifier S 1520-9210(02)04857-5. development of an indexing method to accelerate the speed of the -NN search. For applications where the vectors have low or medium di- mensionalities (e.g., less than 10), the state-of-the-art tree-based indexing techniques such as the R*-tree [2], the X-tree [5], the HG-tree [7], and the SR-tree [14] can be usefully employed to solve the -NN problem. So far, however, there is no effective solution to this problem for the applications in which the vectors have high dimensionalities, say over 100. Therefore, the main issue is to overcome the curse of dimensionality [20]–a phe- nomenon that the performance of indexing methods degrades drastically as the dimensionality increases. A. Motivation Recently, we developed a new vector approximation-based indexing method called the local polar coordinate (LPC)-file [6] for the -NN search. The LPC-file significantly improved the search performance for large collections of high-dimensional vectors compared with the linear scan and the VA-file [23]. The linear scan is often used as the yardstick for comparing with other indexing methods since most tree-structured indexing methods could not defeat it in high-dimensional data spaces. The VA-file was the first vector approximation-based indexing method to overcome the dimensionality curse. Although the LPC-file provided significant improvements compared with previous techniques, it suffers the performance degradation if the dataset is highly clustered because it employs a simple space partitioning strategy and the uniform bit allocation strategy for representing the partitioned region. In the current vector approximation approach including the VA-file and the LPC-file, there is an implicit assumption that it is very unlikely that several points lie in the same cell. Actually, the vector approximation approach benefits from the sparse- ness of the high-dimensional data space as opposed to the par- titioning or clustering-based approach (i.e., the traditional in- dexing methods) when several data points are assumed not to fall into the same cell. However, if the data points are highly clustered as in image datasets, the probability that a certain cell includes several points increases, and therefore those vectors may use the same approximation. This means that the discrimi- natory power of the approximation decreases and thus less elim- ination of candidates is performed in each phase of the -NN search, and ultimately more disk accesses are required during the search. Figs. 1 and 2 show the vector selectivity comparison between random and real image datasets during the first filtering and the second refinement phases of the -NN search in the vector ap- proximation approach, respectively. The vector selectivity is de- 1520-9210/02$17.00 © 2002 IEEE