MCDB-R: Risk Analysis in the Database Subi Arumugam 1 Ravi Jampani 1 Luis L. Perez 2 Fei Xu 3,∗ Christopher Jermaine 2 Peter J. Haas 4 1 University of Florida 2 Rice University 3 Microsoft Corporation 4 IBM Research - Almaden Gainesville, FL, USA Houston, TX, USA Redmond, WA, USA San Jose, CA, USA {sa2,jampani}@cise.ufl.edu {lp6,cmj4}@rice.edu feixu@microsoft.com phaas@us.ibm.com ABSTRACT Enterprises often need to assess and manage the risk arising from uncertainty in their data. Such uncertainty is typically modeled as a probability distribution over the uncertain data values, specified by means of a complex (often predictive) stochastic model. The probability distribution over data values leads to a probability dis- tribution over database query results, and risk assessment amounts to exploration of the upper or lower tail of a query-result distribu- tion. In this paper, we extend the Monte Carlo Database System to efficiently obtain a set of samples from the tail of a query-result distribution by adapting recent “Gibbs cloning” ideas from the sim- ulation literature to a database setting. 1. INTRODUCTION In the face of regulatory processes such as Basel II and Sol- vency 2, enterprises are becoming increasingly concerned with man- aging and assessing the credit, financial, engineering, and opera- tional risk arising from uncertain data [16]. Examples of uncertain data include future values of financial assets, customer order quan- tities under hypothetical price changes, and transportation times for future shipments under alternative shipping schemes. Data uncertainty is usually modeled as a probability distribution over possible data values, and such probabilities are often specified via complex stochastic models. E.g., we might specify the forego- ing uncertain financial, retail, and logistics data sets using Euler ap- proximations to stochastic differential equations, Bayesian demand models, and stochastic network models, respectively. This spec- ification in turn leads to the representation of data uncertainty as a probability distribution over possible database instances (“possi- ble DBs,” also called “possible worlds”). Running an aggregation query such as “select sum of sales” over uncertain data therefore does not yield a single, deterministic result for total sales; rather, there is a probability distribution over all possible query results. In this setting, risk assessment typically corresponds to comput- ing interesting properties of the upper or lower tails of the query- result distribution; for example, computing the probability of a ∗ Fei Xu performed this work while at U. Florida. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were presented at The 36th International Conference on Very Large Data Bases, September 13-17, 2010, Singapore. Proceedings of the VLDB Endowment, Vol. 3, No. 1 Copyright 2010 VLDB Endowment 2150-8097/10/09... $ 10.00. large investment loss. The problem is made more complex by the fact that the tails of the distribution are hard to specify a priori, and so risk analysis frequently focuses on the inverse problem of deter- mining an extreme quantile, e.g., determining the value γ such that there is a 0.1% probability of seeing a loss of γ or more. Such a “value at risk” can be viewed as a probabilistic worst-case scenario, and might be used to specify precisely where the “upper tail” of the loss distribution starts. Proceeding further, one might be interested in computing a more sophisticated “coherent” risk measure such as “expected shortfall” [16], which in this example is defined as the expected total loss, given that this loss exceeds γ. More generally, the entire conditional distribution of the loss—given that the loss exceeds γ—might be of interest. MCDB In prior work, the Monte Carlo Database System (MCDB) of Jampani et al. [13] was designed for flexible exploration of query- result distributions under arbitrary SQL queries and a broad range of complex, user-defined stochastic models. MCDB uses (pos- sibly user-defined) “variable generation” (VG) functions to pseu- dorandomly generate instances of each uncertain data value in a database, yielding a sample from the possible-DB distribution. Re- peating this process multiple times (i.e., executing multiple “Monte Carlo repetitions”) generates a set of independent and identically distributed (i.i.d.) samples from this distribution. Given an SQL aggregation query of interest, MCDB executes the query on each sampled DB instance, thereby generating i.i.d. samples from the query-result distribution. MCDB uses Monte Carlo techniques to estimate interesting features of the query-result distribution—the expected value, variance, and quantiles of the query answer—along with probabilistic error bounds on the estimates. Importantly, a VG function takes as input one or more parameter tables (ordinary re- lations) that control the function’s behavior, and produces as output a table containing one or more correlated data values. For n Monte Carlo repetitions, MCDB does not actually mate- rialize an uncertain database n times; the costs for such a naive approach would be exorbitant. Instead, MCDB executes a query plan only once over a set of tuple bundles rather than over or- dinary tuples—no matter how many Monte Carlo repetitions are required—often leading to significant time savings. A tuple bundle encapsulates the instantiations of a tuple over a set of generated DB instances, and carries along the pseudorandom number seeds used by the VG functions to instantiate the uncertain data values. Limitations of MCDB for Risk Analysis Unfortunately, naive Monte Carlo, as implemented in the original MCDB prototype, is not the best tool for exploring the tails of a query-result distri- bution. Consider the query SELECT SUM(loss) AS total- Loss FROM t, where t.loss is an uncertain attribute, perhaps representing a future financial loss. Suppose that the query-result