Pergamon Information Systems Vol. 21, No. 6, pp. 475-496, 1996 Copyright Q 1996 ElsevierScienceLtd Printed in Great Britain. All rights reserved PII: SO306-4379(96)00024-5 0306-4379196 Sl5.00+ 0.00 zyxwvutsrqp PARTITIONING SIMILARITY GRAPHS: A FRAMEWORK FOR DECLUSTERING PROBLEMS+ DUEN-REN LIU’ and SHASHI SHEKHAR~ ‘Institute of Information Management, National Chiao Tung University, Hsinchu, Taiwan, Fl.0.C 2Department of Computer zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Science, University of Minnesota, Minneapolis, MN 55455, US4 zyxwvutsrqponmlkjih (Received 87 July 1994; in final revised form 12 August 1996) zyxwvutsrqponmlkjihgfedcbaZY Abstract - Declustering problems are well-known in the databases for parallel computing envi- ronments. In this paper, we propose a new similarity-based technique for declustering data. The proposed method can adapt to the available information about query distribution (e.g. size, shape and frequency) and can work with alternative atomic data-types. Furthermore, the proposed method is flexible and can work with alternative data distributions, data sizes and partition-size constraints. The method is based on max-cut partitioning of a similarity graph defined over the given set of data, under constraints on the partition sizes. It maximizes the chances that a pair of atomic data-items that are frequently accessed together by queries are allocated to distinct disks. We describe the application of the proposed method to parallelizing Grid Files at the data page level. Detailed experiments in this context show that the proposed method adapts to query distribution and data distribution, and that it outperforms traditional mapping-function-based methods for many interesting query distributions as well for several non-uniform data distributions. Copyright 01996 Elsevier Science Ltd Key words: Similarity Graph, Geographic Databases, Declustering, Grid File, Parallel Databases 1. INTRODUCTION With an increasing performance gap between processors and I/O systems, parallelizing I/O operations by declustering [12, 11, 301 data is becoming essential for high performance applica- tions. Database machines, multi-processors and parallel computers can all benefit from effective declustering. The declustering problem can be stated as follows: Given a set of atomic data-items, N disks, and a set of queries, divide the set of data items among the N disks, respecting the disk capacity constraints, to minimize response time for the given set of queries. Unfortunately, this problem is NP-complete in several contexts, which include partial match queries on Cartesian product files [ll] and join queries on a set of relations [30]. Thus any method to solve this problem in polynomial. time will be heuristic. We address the declustering problem in a single processor with a multi-disk environment. We abstract the properties of multi-disk secondary storage systems in terms of their capability of car- rying out N-independent disk operations in parallel. The storage system is viewed as a collection of logical disks, each with an independent read/write head and an independent channel to transfer data to/from the processor’s memory. Disk block accesses over different logical disks are indepen- dent and can be carried out in parallel. Thus the storage system can reduce the response time for large I/O volumes by a factor of N, where N is the number of disks in the system. We focus on I/O cost only. Readers are referred to MAGIC [17] for a more general cost model that includes communication cost,s. Furthermore, the data items are assumed to be atomic, i.e., a data item will not be split across disks. Data items like records, objects, pages and page-clusters are likely to satisfy this assumption. This assumption excludes strategies such as splitting a data item (e.g. files) across disks. Several heuristic methods have been proposed that are based on the ideas of mapping functions, similarity and load-balancing. The mapping-function-based techniques have been proposed for k- dimensional and spatial data with partial match queries and range queries. These methods provide trtecommended by F’atrick O’Neil 475