Bypass Caching: Making Scientific Databases Good Network Citizens Tanu Malik, Randal Burns Amitabh Chaudhary Department of Computer Science Department of Computer Science and Engineering Johns Hopkins University University of Notre Dame Baltimore, MD 21218 Notre Dame, IN 46556 {tmalik, randal}@cs.jhu.edu Amitabh.Chaudhary.1@nd.edu Abstract Scientific database federations are geographically dis- tributed and network bound. Thus, they could benefit from proxy caching. However, existing caching tech- niques are not suitable for their workloads, which compare and join large data sets. Existing techniques reduce paral- lelism by conducting distributed queries in a single cache and lose the data reduction benefits of performing selec- tions at each database. We develop the bypass-yield for- mulation of caching, which reduces network traffic in wide-area database federations, while preserving paral- lelism and data reduction. Bypass-yield caching is altruis- tic; caches minimize the overall network traffic generated by the federation, rather than focusing on local perfor- mance. We present an adaptive, workload-driven algo- rithm for managing a bypass-yield cache. We also develop on-line algorithms that make no assumptions about work- load: a k-competitive deterministic algorithm and a ran- domized algorithm with minimal space complexity. We verify the efficacy of bypass-yield caching by running work- load traces collected from the Sloan Digital Sky Survey through a prototype implementation. 1. Introduction An increasing number of science organizations are pub- lishing their databases on the Internet, making data avail- able to the larger community. Applications such as Sky- Query [37], PlasmoDB [31], and Distributed Oceano- graphic Data System (DODS) [14] use the published archives for comprehensive experiments that involve merg- ing, joining, and comparing Gigabyte and Terabyte datasets. As these data-intensive scientific applications in- crease in scale and number, network bandwidth constrains the performance of all applications that share a net- work. We are particularly interested in the scalability and net- work performance of SkyQuery [27]. SkyQuery is the * This work was supported in part by NSF awards IIS-0430848 and ACI-0086044, by DOE award P020685, and by the IBM Corporation. mediation middleware used in the World Wide Tele- scope (WWT) – a virtual telescope for multi-spectral and temporal experiments. The WWT is an exemplar sci- entific database federation, supporting queries across vast amounts of freely-available, widely-distributed data [15]. The WWT faces an impending scalability cri- sis. With fewer than 10 sites, network performance limits responsiveness and throughput already. We expect the fed- eration to expand to more than 120 sites in 2006. While caching is the principal solution to scalability and performance, existing database caching solutions fail to meet the needs of scientific databases. Caching is a dan- gerous technology because it can reduce the parallelism and data filtering benefits of database federations. Thus, caching must be applied judiciously. A query against a federation is divided into sub-queries against member sites, which are evaluated in parallel. Parallel evaluation brings great com- putational resources to bear on experiments that are initiated from the weakest of computers. Caching can reduce par- allelism by moving workload from many databases to few caches. Running queries at the databases also filters results [4], producing compact results from large tables. Many sci- entific queries operate against a large amount of data. Bring- ing the large data into cache and computing a small result can waste an arbitrarily large amount of network bandwidth. The primary goal in current database caching solutions [3, 18, 22] is to maximize hit rate and minimize response time for a single application. Minimizing network traffic is a secondary goal. Organizations have no direct motivation to reduce network traffic because they are not charged by the amount of bandwidth they consume. However, it is im- perative for data-intensive applications to focus on being good “network citizens” and using shared resources consci- entiously. If not, the workloads generated by these applica- tions will make them unwelcome on public networks. We propose bypass-yield caching, an altruistic caching framework for scientific database workloads. As its princi- pal goal, it adopts network citizenship: caching data in order to minimize network traffic. Bypass-yield caching profiles workload to differentiate between data objects for which caching saves network bandwidth and those which should not be cached. The latter are routed directly to the back-end Proceedings of the 21st International Conference on Data Engineering (ICDE 2005) 1084-4627/05 $20.00 © 2005 IEEE Authorized licensed use limited to: Johns Hopkins University. Downloaded on January 25, 2009 at 15:18 from IEEE Xplore. Restrictions apply.