Estimating Query Result Sizes for Bypass-Yield Caches Tanu Malik, Randal Burns Dept of Computer Science Johns Hopkins University Baltimore, MD 21218 tmalik,randal@cs.jhu.edu Nitesh V Chawla Dept of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 nchawla@cse.nd.edu ABSTRACT In proxy database caches – especially which minimize the amount of total network traﬃc – it is important to estimate the size of a query before making a caching decision. In principle, optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache ineﬀective. We present classiﬁcation and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environments of proxy caches. CAROT estimates query result sizes by learning data distributions, not by examining or sampling data, but from observing workload: queries and their re- sults. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of as- tronomy databases. Experiments conducted in the NVO show that CAROT outperforms conventional estimation techniques and provides near-optimal cache performance. 1. INTRODUCTION The NVO is a global-scale, multi-Terabyte federation of as- tronomical databases. It is used by astronomers world-wide to conduct data-intensive multi-spectral and temporal ex- periments, and has lead to many new discoveries [16]. At its present size – 16 sites – network bandwidth bounds per- formance and limits scalability. The federation is expected to grow to include 120 sites in 2007. Proxy caching [4, 2] can reduce the network bandwidth re- quirements of the NVO and, thus, is critical for achieving scale. In previous work, we demonstrated that bypass-yield caching (BYC) [1] can reduce the bandwidth requirements of the Sloan Digital Sky Survey (SDSS) [17] by a factor of ﬁve. SDSS is a principal site of the NVO. Proxy caching frame- works for databases, such as bypass-yield caching, load and evict objects based on their expected yield: the size of the query results against that object or, equivalently, the net- work savings realized from caching the object. The ﬁve-fold reduction in bandwidth is an upper bound, realized when the cache has perfect, a priori knowledge of query result sizes. In practice, a cache must estimate yield. Estiamting yield is easy if the cache stores an approxima- tion of the data distribution in an object. However, there are several challenges in learning and maintaining such approx- imations. First, proxy caches are situated close to clients and, therefore, in a diﬀerent organizational domain than databases which store the objects. This restricts access to object data, especially in federations, where requirements of autonomy and privacy are quite stringent []. Second, a cache is a constrained resource in terms of storage. Thus learned approximations have to be compact, and yet accurate. Fi- nally, a cache request may refer to multiple objects. It is important not only to learn approximation of distribution of data in a single object but of a combination of objects. Similar challenges are faced by distributed applications which rely on accurate estimation of query result sizes. The reliance of BYC on yield estimates provides one such exam- ple. Others include load balancing [14], replica maintenance [12, 13], grid computing [], Web caching [2], and distributed query optimization []. In many such applications, estimation is largely ignored; estimation is an orthogonal issue, because any accurate technique suﬃces and thus is not part of the architecture. Several statistical techniques exist for learning an object data distribution. These include sampling, histograms, wavelets, kernel density estimators. Most of these tech- niques have so far been considered in the context of query optimization within a database system. Therefore, com- plete access to data is always assumed. Further, most of these techniques are not suﬃcently compact when learning distributions of a combination of objects. For example, for histograms, which are the most popular method of learning distributions in databases, the storage overhead and con- struction cost increase exponentially with the number of combinations [18]. We depart from current solutions in that no object data dis- tribution is learned; This requires complete access of data. We, instead, learn the yield distribution of a template, where a template is formed by grouping query statements based on syntactic similarity. Queries within the same template have the same structure against the same set of attributes and re- lations; they diﬀer in minor ways, using diﬀerent constants