Bypass Caching: Making Scientific Databases Good Network Citizens
∗
Tanu Malik, Randal Burns Amitabh Chaudhary
Department of Computer Science Department of Computer Science and Engineering
Johns Hopkins University University of Notre Dame
Baltimore, MD 21218 Notre Dame, IN 46556
{tmalik, randal}@cs.jhu.edu Amitabh.Chaudhary.1@nd.edu
Abstract
Scientific database federations are geographically dis-
tributed and network bound. Thus, they could benefit
from proxy caching. However, existing caching tech-
niques are not suitable for their workloads, which compare
and join large data sets. Existing techniques reduce paral-
lelism by conducting distributed queries in a single cache
and lose the data reduction benefits of performing selec-
tions at each database. We develop the bypass-yield for-
mulation of caching, which reduces network traffic in
wide-area database federations, while preserving paral-
lelism and data reduction. Bypass-yield caching is altruis-
tic; caches minimize the overall network traffic generated
by the federation, rather than focusing on local perfor-
mance. We present an adaptive, workload-driven algo-
rithm for managing a bypass-yield cache. We also develop
on-line algorithms that make no assumptions about work-
load: a k-competitive deterministic algorithm and a ran-
domized algorithm with minimal space complexity. We
verify the efficacy of bypass-yield caching by running work-
load traces collected from the Sloan Digital Sky Survey
through a prototype implementation.
1. Introduction
An increasing number of science organizations are pub-
lishing their databases on the Internet, making data avail-
able to the larger community. Applications such as Sky-
Query [37], PlasmoDB [31], and Distributed Oceano-
graphic Data System (DODS) [14] use the published
archives for comprehensive experiments that involve merg-
ing, joining, and comparing Gigabyte and Terabyte
datasets. As these data-intensive scientific applications in-
crease in scale and number, network bandwidth constrains
the performance of all applications that share a net-
work.
We are particularly interested in the scalability and net-
work performance of SkyQuery [27]. SkyQuery is the
* This work was supported in part by NSF awards IIS-0430848 and
ACI-0086044, by DOE award P020685, and by the IBM Corporation.
mediation middleware used in the World Wide Tele-
scope (WWT) – a virtual telescope for multi-spectral and
temporal experiments. The WWT is an exemplar sci-
entific database federation, supporting queries across
vast amounts of freely-available, widely-distributed data
[15]. The WWT faces an impending scalability cri-
sis. With fewer than 10 sites, network performance limits
responsiveness and throughput already. We expect the fed-
eration to expand to more than 120 sites in 2006.
While caching is the principal solution to scalability
and performance, existing database caching solutions fail
to meet the needs of scientific databases. Caching is a dan-
gerous technology because it can reduce the parallelism and
data filtering benefits of database federations. Thus, caching
must be applied judiciously. A query against a federation is
divided into sub-queries against member sites, which are
evaluated in parallel. Parallel evaluation brings great com-
putational resources to bear on experiments that are initiated
from the weakest of computers. Caching can reduce par-
allelism by moving workload from many databases to few
caches. Running queries at the databases also filters results
[4], producing compact results from large tables. Many sci-
entific queries operate against a large amount of data. Bring-
ing the large data into cache and computing a small result
can waste an arbitrarily large amount of network bandwidth.
The primary goal in current database caching solutions
[3, 18, 22] is to maximize hit rate and minimize response
time for a single application. Minimizing network traffic is
a secondary goal. Organizations have no direct motivation
to reduce network traffic because they are not charged by
the amount of bandwidth they consume. However, it is im-
perative for data-intensive applications to focus on being
good “network citizens” and using shared resources consci-
entiously. If not, the workloads generated by these applica-
tions will make them unwelcome on public networks.
We propose bypass-yield caching, an altruistic caching
framework for scientific database workloads. As its princi-
pal goal, it adopts network citizenship: caching data in order
to minimize network traffic. Bypass-yield caching profiles
workload to differentiate between data objects for which
caching saves network bandwidth and those which should
not be cached. The latter are routed directly to the back-end
Proceedings of the 21st International Conference on Data Engineering (ICDE 2005)
1084-4627/05 $20.00 © 2005 IEEE
Authorized licensed use limited to: Johns Hopkins University. Downloaded on January 25, 2009 at 15:18 from IEEE Xplore. Restrictions apply.