Distributed and Parallel Databases, 9, 67–92, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Distributed Processing of Similarity Queries APOSTOLOS N. PAPADOPOULOS apostol@delab.csd.auth.gr YANNIS MANOLOPOULOS yannis@delab.csd.auth.gr Data Engineering Research Lab., Department of Informatics, Aristotle University, 54006 Thessaloniki, Greece Received July 23, 1997; Revised November 6, 1998; Accepted May 21, 1999 Abstract. Many modern applications in diverse fields demand the efficient manipulation of very large multidi- mensional datasets. It is evident, that efficient and effective query processing techniques need to be developed, in order to provide acceptable response times in query processing. In this paper, we study the processing of similarity nearest neighbor queries in large distributed multidimensional databases, where objects are represented as vectors in a vector space, and are distributed in a multi-computer environment. The departure from the centralized case embodies a number of advantages and (unfortunately) a number of difficulties that need to be successfully over- come. In this perspective, four query evaluation strategies are presented, namely Concurrent Processing (CP), Selective Processing (SP), Two-Phase Processing (2PP) and Probabilistic Processing (PRP). The proposed techniques are compared analytically and experimentally, in order to discover the advantages of each one, as well as the best cases where each one should be applied. Experimental results are presented, demonstrating the performance of each method under different parameters values. Also, we investigate the impact of derived data that should be maintained in order to process similarity queries efficiently. Keywords: distributed databases, multidimensional data, similarity queries, query processing 1. Introduction Multidimensional data appear in many modern applications in diverse fields. Geographical Information Systems (G.I.S) require the storage and manipulation of objects in 2-d and 3-d spaces [14, 19]; Image/Video Retrieval by Similarity and Pattern Recognition require the extraction of features from the images and the mapping of these features in a high dimensional space, in order to speed up query processing [11, 17]; in Time Series databases, the objects are first transformed (e.g. by means of the Fourier transform) and then some components (these with the highest energy) are used to index the underlying set [2, 10]; documents in a Text Database can be represented as vectors (e.g. using the Latent Semantic Indexing technique) in a high dimensional space [8]; records in traditional alphanumeric databases can be viewed as points in a high dimensional space, assuming one dimension for each record attribute [16]. These applications require databases that are huge in volume. Often, multiple computer systems are used in order to support efficient and effective retrieval. In some cases, due to the nature of the application, there is no alternative except distribution (e.g. in medical information systems, where data corresponding to one hospital are apart from the data regarding another hospital). The departure from the centralized case offers a number of significant advantages such as: parallelism exploitation during query processing, distributed