Time and Space Optimization for Processing Groups of Multi-Dimensional Scientific Queries Suresh Aryangat, Henrique Andrade, Alan Sussman Department of Computer Science University of Maryland College Park, MD 20742 suresha,hcma,als @cs.umd.edu Abstract Data analysis applications in areas as diverse as remote sensing and telepathology require operating on and pro- cessing very large datasets. For such applications to exe- cute efficiently, careful attention must be paid to the storage, retrieval, and manipulation of the datasets. This paper ad- dresses the optimizations performed by a high performance database system that processes groups of data analysis re- quests for these applications, which we call queries. The system performs end-to-end processing of the requests, for- mulated as PostgreSQL declarative queries. The queries are converted into imperative descriptions, multiple imper- ative descriptions are merged into a single execution plan, the plan is optimized to decrease execution time via com- mon compiler optimization techniques, and, finally, the plan is optimized to decrease memory consumption. The last two steps effectively reduce both the time and space to execute query groups, as shown in the experimental results. 1 Introduction Many applications are emerging that process very large multi-dimensional datasets. One example of such an appli- cation is Kronos, which is used by earth scientists to process satellite images of the Earth. Another example is the Vir- tual Microscope, which provides realistic digital emulation of a high power light microscope. Many similar data analy- sis applications display a common processing structure [2]. We have developed a scientific database system that exploits this common processing structure and performs various op- timizations geared towards reducing both the time to pro- This research was supported by the National Science Foundation un- der Grants #EIA-0121161, #ACI-9619020 (UC Subcontract #10152408), and #ACI-9982087, Lawrence Livermore National Laboratory under Grant #B517095, and NASA under Grants #NAG5-11994 and #NAG5-12652. cess a single data analysis request (a query) and to improve database throughput. Queries from these applications are not only expensive to compute, consuming large amounts of I/O and computational resources, but also have very high memory utilization requirements and often require execut- ing user-defined operations that are not easily implemented in commercial relational databases. In previous work, we have leveraged data and computation reuse for queries in- dividually submitted to the system over an extended period of time. However, for a set of queries considered as a sin- gle group, a global query plan that accommodates all the queries can often be much more efficient than executing each query separately, assuming that we can devise methods for efficiently removing redundancies across queries, while minimizing the use of memory resources. The system accepts a declarative data analysis query, specified in PostgreSQL, for a group of queries, and con- verts it to an imperative form. The resulting program con- tains one or more loops over multidimensional ranges of some subset of the dataset attributes (e.g., latitude and lon- gitude for remote sensing). Once in imperative form, the system performs query plan transformations based on al- gorithms commonly used by compilers to optimize the in- termediate or low-level representations of program source code. After performing these execution time reducing trans- formations, each original query is converted into a se- quence of loops iterating on subsets of the original attribute range(s). Because they came from a declarative query, the loops have the property that changing their execution order does not change the correctness of the results. The system is therefore allowed to reorder the execution of the loops to also optimize memory usage. Reducing memory utiliza- tion is important because it affects the performance of the system (e.g., by avoiding paging), and may also affect the performance of other applications sharing the same proces- sor and physical memory. In this paper, we describe the time and space optimization techniques we have designed 1