Estimating Query Result Sizes for Proxy Caching in Scientific Database Federations Tanu Malik, Randal Burns Nitesh V Chawla Alex Szalay Dept. of Computer Science Dept. of Computer Science and Engg. Dept. of Physics and Astronomy Johns Hopkins University University of Notre Dame Johns Hopkins University Baltimore, MD 21218 Notre Dame, IN 46556 Baltimore, MD 21218 {tmalik, randal}@cs.jhu.edu nchawla@cse.nd.edu szalay@jhu.edu Abstract In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffec- tive. We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database fed- erations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sam- pling data, but from observing workload. We have inte- grated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Ex- periments conducted in the NVO show that CAROT dramat- ically outperforms conventional estimation techniques and provides near-optimal cache performance. 1. Introduction The National Virtual Observatory (NVO) is a globally- distributed, multi-Terabyte federation of astronomical databases. It is used by astronomers world-wide to con- duct data-intensive multi-spectral and temporal experi- ments and has lead to many new discoveries [33]. At its present size – 16 sites – network bandwidth bounds perfor- mance and limits scalability. The federation is expected to grow to include 120 sites in 2007. Proxy caching [3, 6] can reduce the network bandwidth requirements of the NVO and, thus, is critical for achieving scale. In previous work, we demonstrated that bypass-yield (BY) caching [21] can reduce the bandwidth requirements of the Sloan Digital Sky Survey (SDSS) [34] by a factor of five. SDSS is a principal site of the NVO. Proxy caching frameworks for scientific databases, such as bypass-yield caching, replicate database objects, such as columns (at- tributes), tables, or views, near clients so that queries to the database may be served locally, reducing network band- width requirements. BY caches load and evict the database objects based on their expected yield: the size of the query results against that object or, equivalently, the network sav- ings realized from caching the object. The five-fold reduc- tion in bandwidth is an upper bound, realized when the cache has perfect, a priori knowledge of query result sizes. In practice, a cache must estimate yield. Similar challenges are faced by distributed applications which rely on accurate estimation of query result sizes. Other examples include load balancing [26], replica mainte- nance [23,24], grid computing [30], Web caching [3], and distributed query optimization [2]. In many such applica- tions, estimation is largely ignored; estimation is an orthog- onal issue and any accurate technique is assumed to be suf- ficient. Existing techniques for yield estimation do not translate to proxy caching for scientific databases. Databases esti- mate yield by storing a small approximation of data dis- tribution in an object. There are several obstacles to learn- ing and maintaining such approximations in caches. First, proxy caches are situated closer to clients, implying that a cache learns distributions on remote data. Generating ap- proximations incurs I/O at the databases and network traffic for the cache, reducing the benefit of caching. Caches and databases are often in different organizational domains and the requirements for autonomy and privacy in federations are quite stringent [31]. Second, a cache is a constrained re- source in terms of storage. Thus, data structures for estima- tion must be compact. Finally, scientific queries are com- plex and refer to many database attributes and join multi- ple tables. Traditional database methods typically perform poorly in such cases – they assume independence while building yield estimates from the component object data distributions. For such workloads, it is important to learn approximations of the joint data distribution of objects. Several statistical techniques exist for learning approxi- mate data distributions. These include sampling, histograms [26], wavelets [7,35], and kernel density estimators [12].