Nearest Neighbours Search Using the PM-Tree Tom´aˇ s Skopal 1 , Jaroslav Pokorn´ y 1 , and V´ aclav Sn´ aˇ sel 2 1 Charles University in Prague, FMP, Department of Software Engineering, Malostransk´ e n´am. 25, 118 00 Prague, Czech Republic, EU tomas@skopal.net, jaroslav.pokorny@mff.cuni.cz 2 V ˇ SB–Technical University of Ostrava, FECS, Dept. of Computer Science, tˇ r. 17. listopadu 15, 708 33 Ostrava, Czech Republic, EU vaclav.snasel@vsb.cz Abstract. We introduce a method of searching the k nearest neighbours (k-NN) using PM-tree. The PM-tree is a metric access method for sim- ilarity search in large multimedia databases. As an extension of M-tree, the structure of PM-tree exploits local dynamic pivots (like M-tree does it) as well as global static pivots (used by LAESA-like methods). While in M-tree a metric region is represented by a hyper-sphere, in PM-tree the ”volume” of metric region is further reduced by a set of hyper-rings. As a consequence, the shape of PM-tree’s metric region bounds the in- dexed objects more tightly which, in turn, improves the overall search efficiency. Besides the description of PM-tree, we propose an optimal k-NN search algorithm. Finally, the efficiency of k-NN search is experi- mentally evaluated on large synthetic as well as real-world datasets. 1 Introduction The volume of multimedia databases rapidly increases and the need for efficient content-based search in large multimedia databases becomes stronger. In partic- ular, there is a need for searching for the k most similar documents (called the k nearest neighbours – k-NN) to a given query document. Since multimedia documents are modelled by objects (usually vectors) in a feature space U, the multimedia database can be represented by a dataset S ⊂ U, where n = |S| is size of the dataset. The search in S is accomplished by an access method, which retrieves objects relevant to a given similarity query. The similarity measure is often modelled by a metric, i.e. a distance d satisfying properties of reflexivity, positivity, symmetry, and triangular inequality. Given a metric space M =(U,d), the metric access methods (MAMs) [4] organize objects in S such that a structure in S is recognized (i.e. a kind of metric index is constructed) and exploited for efficient (i.e. quick) search in S. To keep the search as efficient as possible, the MAMs should minimize the computation costs (CC) and the I/O costs. The computation costs represent the number of (com- putationally expensive) distance computations spent by the query evaluation. The I/O costs are related to the volume of data needed to be transfered from secondary memory (also referred to as the disk access costs). L. Zhou, B.C. Ooi, and X. Meng (Eds.): DASFAA 2005, LNCS 3453, pp. 803–815, 2005. c Springer-Verlag Berlin Heidelberg 2005