IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 2, JUNE 2002 235
The GC-Tree: A High-Dimensional Index Structure
for Similarity Search in Image Databases
Guang-Ho Cha and Chin-Wan Chung
Abstract—With the proliferation of multimedia data, there is an
increasing need to support the indexing and retrieval of high-di-
mensional image data. In this paper, we propose a new dynamic
index structure called the GC-tree (or the grid cell tree) for efficient
similarity search in image databases. The GC-tree is based on a
special subspace partitioning strategy which is optimized for a clus-
tered high-dimensional image dataset. The basic ideas are three-
fold: 1) we adaptively partition the data space based on a density
function that identifies dense and sparse regions in a data space; 2)
we concentrate the partition on the dense regions, and the objects
in the sparse regions of a certain partition level are treated as if they
lie within a single region; and 3) we dynamically construct an index
structure that corresponds to the space partition hierarchy. The
resultant index structure adapts well to the strongly clustered dis-
tribution of high-dimensional image datasets. To demonstrate the
practical effectiveness of the GC-tree, we experimentally compared
the GC-tree with the IQ-tree, the LPC-file, the VA-file, and the
linear scan. The result of our experiments shows that the GC-tree
outperforms all other methods.
Index Terms—Dynamic index structure, GC-tree, high-dimen-
sional indexing, image database, nearest neighbor search (NN
search), similarity search.
I. INTRODUCTION
S
IMILARITY search in high-dimensional image databases
is an interesting and important, but difficult problem. The
most typical type of similarity search is the -nearest neighbor
( -NN) search. The traditional -NN problem is defined as fol-
lows. Consider a database DB consisting of points from
, where . Each usually consists of ei-
ther integers or floats. A -NN query consists of a point
and a positive integer . The -NN search finds the nearest
neighbors of with respect to a distance function . The
output set consists of points from the database such that
and
The actual problem in image database applications is how to
process such queries so that the nearest objects can be returned
within the desired response time. Therefore, our focus is the
Manuscript received April 21, 2001; revised February 26, 2002. This work
was supported by the University Fundamental Research Program of Ministry
of Information and Communication in the Republic of Korea under Grant
2001-116-3. The associate editor coordinating the review of this paper and
approving it for publication was Dr. Sankar Basu.
G.-H. Cha is with the Department of Multimedia Science, Sookmyung
Women’s University, Seoul 140-742, South Korea (e-mail: ghcha@sook-
myung.ac.kr).
C.-W. Chung is with the Department of Computer Science, Korea Advanced
Institute of Science and Technology, Taejon 305-701, South Korea (e-mail:
chungcw@islab.kaist.ac.kr).
Publisher Item Identifier S 1520-9210(02)04857-5.
development of an indexing method to accelerate the speed of
the -NN search.
For applications where the vectors have low or medium di-
mensionalities (e.g., less than 10), the state-of-the-art tree-based
indexing techniques such as the R*-tree [2], the X-tree [5], the
HG-tree [7], and the SR-tree [14] can be usefully employed to
solve the -NN problem. So far, however, there is no effective
solution to this problem for the applications in which the vectors
have high dimensionalities, say over 100. Therefore, the main
issue is to overcome the curse of dimensionality [20]–a phe-
nomenon that the performance of indexing methods degrades
drastically as the dimensionality increases.
A. Motivation
Recently, we developed a new vector approximation-based
indexing method called the local polar coordinate (LPC)-file [6]
for the -NN search. The LPC-file significantly improved the
search performance for large collections of high-dimensional
vectors compared with the linear scan and the VA-file [23]. The
linear scan is often used as the yardstick for comparing with
other indexing methods since most tree-structured indexing
methods could not defeat it in high-dimensional data spaces.
The VA-file was the first vector approximation-based indexing
method to overcome the dimensionality curse. Although the
LPC-file provided significant improvements compared with
previous techniques, it suffers the performance degradation
if the dataset is highly clustered because it employs a simple
space partitioning strategy and the uniform bit allocation
strategy for representing the partitioned region.
In the current vector approximation approach including the
VA-file and the LPC-file, there is an implicit assumption that it
is very unlikely that several points lie in the same cell. Actually,
the vector approximation approach benefits from the sparse-
ness of the high-dimensional data space as opposed to the par-
titioning or clustering-based approach (i.e., the traditional in-
dexing methods) when several data points are assumed not to
fall into the same cell. However, if the data points are highly
clustered as in image datasets, the probability that a certain cell
includes several points increases, and therefore those vectors
may use the same approximation. This means that the discrimi-
natory power of the approximation decreases and thus less elim-
ination of candidates is performed in each phase of the -NN
search, and ultimately more disk accesses are required during
the search.
Figs. 1 and 2 show the vector selectivity comparison between
random and real image datasets during the first filtering and the
second refinement phases of the -NN search in the vector ap-
proximation approach, respectively. The vector selectivity is de-
1520-9210/02$17.00 © 2002 IEEE