Effective Nearest Neighbor Indexing with the Euclidean Metric Sang-Wook Kim Charu C. Aggarwal and Philip S. Yu Division of Computer, Information, and Communications Engineering Kangwon National University wook@kangwon.ac.kr ABSTRACT The nearest neighbor search is an important operation widely-used in multimedia databases. In higher dimensions, most of previous methods for nearest neighbor search become inefficient and require to compute nearest neighbor distances to a large fraction of points in the space. In this paper, we present a new approach for processing nearest neighbor search with the Euclidean metric, which searches over only a small subset of the original space. This approach effectively approximates clusters by encapsulating them into geometrically regular shapes and also computes better upper and lower bounds of the distances from the query point to the clusters. For showing the effectiveness of the proposed approach, we perform extensive experiments. The results reveal that the proposed approach significantly outperforms the X-tree as well as the sequential scan. Keywords Similarity search, nearest neighbor queries, multimedia databases, high dimensional indexes, Euclidean metric 1. Introduction The similarity search is an important issue in the field of multimedia databases[5]. Often, it may be desirable to provide the functionality of searching for similar images in the database. The features of images can be represented as points, called feature vectors, in high dimensional space[1][2][12] . These points represent information about color histograms, textures, or other descriptors of the images. The points are often stored in some form of indexes that facilitate various types of queries on the database. One of the queries that helps in providing the functionality for similarity search is the nearest neighbor query[3][4][14] . The nearest neighbor query is formulated as follows: t;‘or a given target query point t, find the point that has the shortest distance from t in the database. The k-nearest neighbor query is the generalization of the nearest neighbor query, and requires us to find the k points closest to the given target point 1. Various distance functions may be Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specitic permission and/or a fee. CIKM’OI, November 5-10,2001, Atlanta, Georgia, USA. Copyright 2001 ACM I-581 13-436-3/01/0011...$5.00. IBM T. J. Watson Research Center {charu, psyu}@us.ibm.com used in order to determine the notion of proximity. Recent results show that for many distance functions in high dimensionality, the concept of proximity may not be very well defined[l I]. Furthermore, for many applications, the distance function is heuristic to begin with so that the nearest neighbor problem may be viewed from many novel perspectives. For example, locality-specific projections[l l] could be used in order to find the nearest neighbors in projections which are based on dimensional selectivity. Another alternative is to redefine the distance function in order to make it more meaningful and effective. For such applications, we show that it is possible to improve the nearest neighbor search both qualitatively and from a performance perspective. For some problems, however, the distance function is pre-defined, and there is no way of avoiding the spars@ effects of high dimensionality in such cases. The most commonly-used distance function is the Euclidean metric. The results of this paper are tailored to the applications that employ this particular metric as a pre-defined measure. In such cases, it becomes important to provide ways of performing the search more efficiently. For efftcient processing of nearest neighbor queries, there also have been many research efforts on high dimensional indexing such a s R*-trees[8], X-trees[6], M-trees[lO], SR-trees[l3], TV-trees[lS], a n d SS-trees[l9]. Weber et a1.[18] proved that the sequential scan always outperforms the tree-based multidimensional indexes in case of uniformly-distributed data whenever the dimensionality is above 10. To overcome this problem, they proposed an approximate-based scheme with the VA-file, a set of bit-compressed version of points. Recently, Berchtold et a1.[7] proposed a hybrid approach of the VA-file and the tree-based index. The performance of previous multidimensional indexes, which use multidimensional rectangles and/or spheres for representing the capsule of a point cluster, deteriorates seriously as the number of dimensions gets higher. In this paper, we first point out the fact that the simple representation of capsules incurs performance degradation in processing nearest neighbor queries. For alleviating this problem, we propose (1) adopting new coordinate systems appropriate to a given cluster, (2) representing various shapes of capsules by using hyperspheres, and (3) maintaining outliers separately. Our approach effectively approximates clusters by encapsulating them into geometrically regular shapes and also quickly computes better upper and lower bounds of the distances from the query point to the clusters. We also propose an efficient algorithm that touches a small