Query Planning for Range Queries with User-defined Aggregation on Multi-dimensional Scientific Datasets Chialin Chang , Tahsin Kurc , Alan Sussman , Joel Saltz UMIACS and Dept. of Computer Science University of Maryland College Park, MD 20742 Dept. of Pathology Johns Hopkins Medical Institutions Baltimore, MD 21287 chialin,kurc,als,saltz @cs.umd.edu Abstract Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, the datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space. The processing is usually highly stylized, with the basic processing steps consisting of (1) retrieval of a subset of all available data in the input dataset via a range query, (2) projection of each input data item to one or more output data items, and (3) some form of aggregation of all the input data items that project to the each output data item. We have developed an infrastructure, called the Active Data Repository (ADR), that integrates storage, retrieval and processing of multi-dimensional datasets on shared-nothing architectures. In this paper we address query planning and execution strategies for range queries with user-defined processing. We evaluate three potential query planning strategies within the ADR framework under several application scenarios, and present experimental results on the performance of the strategies on a multiprocessor IBM SP2. 1 Introduction Large amounts of data are being generated in many scientific and engineering studies by detailed simulations, and by sensors attached to devices such as satellites and microscopes. Hence, storage, retrieval, processing and analyzing very large amounts of scientific data has become an important part of scientific research. Typical This research was supported by the National Science Foundation under Grants # ACI-9619020 (UC Subcontract # 10152408)and #BIR 9318183, and the Office of Naval Research under Grant #N6600197C8534. The Maryland IBM SP2 used for the experiments was provided by NSF CISE Institutional Infrastructure Award #CDA9401151 and a grant from IBM. 1