A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang 1 , Tanu Malik 1 , Randal Burns 1 , Stratos Papadomanolakis 2 , and Anastassia Ailamaki 2 1 Johns Hopkins University, USA {xwang, tmalik, randal}@cs.jhu.edu 2 Carnegie Mellon University, USA {stratos, natassa}@cs.cmu.edu Abstract. Making multi-terabyte scientific databases publicly accessible over the Internet is increasingly important in disciplines such as Biology and Astronomy. However, contention at a centralized, backend database is a major performance bottleneck, limiting the scalability of Internet-based, database applications. Mid- tier caching reduces contention at the backend database by distributing database operations to the cache. To improve the performance of mid-tier caches, we propose the caching of query prototypes, a workload-driven unit of cache replacement in which the cache object is chosen from various classes of queries in the workload. In existing mid-tier caching systems, the storage organization in the cache is statically defined. Our approach adapts cache storage to workload changes, requires no prior knowledge about the workload, and is transparent to the application. Experiments over a one-month, 1.4 million query Astronomy workload demonstrate up to 70% reduction in network traffic and reduce query response time by up to a factor of three when compared with alternative units of cache replacement. 1 Introduction The sciences are collecting and analyzing vast amounts of observational data. In Astron- omy, cataloging and mapping spectral characteristics of objects in only a fraction of the sky requires several terabytes of storage. Data are made available to remote users for processing, for example through SkyQuery [1], a federation of Astronomy databases and part of the World-Wide Telescope [2]. However, SkyQuery faces an impending scalability crisis. The federation is expected to expand from roughly a dozen members today to over a hundred in the near future [3]. Furthermore, member databases, such as the Sloan Digital Sky Survey (SDSS) [4], are accumulating data at an astonishing rate. Mid-tier caching is an attractive solution for increasing scalability, availability, and performance of distributed database applications [5]. We study mid-tier caching in the context of SkyQuery using bypass-yield caching [6]. Bypass-yield caching replicates database objects, e.g. columns (attributes), tables, or views, at caches deployed near the clients so that queries are served locally, reducing network bandwidth requirements. Caches service some queries in cache and ship other queries to be evaluated at the backend database. Our experience with bypass-yield caching indicates that query evaluation perfor- mance in the cache is also critical. Despite the network benefits, poor I/O R. Kotagiri et al. (Eds.): DASFAA 2007, LNCS 4443, pp. 374–385, 2007. c Springer-Verlag Berlin Heidelberg 2007