Probabilistic Model for Accuracy Estimation in Approximate Monodimensional Analyses CARLO DELL’AQUILA, FRANCESCO DI TRIA, EZIO LEFONS, FILIPPO TANGORRA Dipartimento di Informatica Università degli Studi di Bari Aldo Moro Via Orabona 4, 70125 Bari ITALY {dellaquila, lefons, tangorra, francescoditria}@di.uniba.it http://www.di.uniba.it/~dblab Abstract: - Approximate query processing is often based on analytical methodologies able to provide fast responses to queries. As a counterpart, the approximate answers are affected with a small quantity of error. Nowadays, these techniques are being exploited in data warehousing environments, because the queries devoted to extract information involve high-cardinality relations and, therefore, require a high computational time. Approximate answers are profitably used in the decision making process, where the total precision is not needed. Thus, it is important to provide decision makers with accuracy estimates of the approximate answers; that is, a measure of how much reliable the approximate answer is. Here, a probabilistic model is presented for providing such an accuracy measure when the analytical methodology used for decisional analyses is based on polynomial approximation. This probabilistic model is a Bayesian network able to estimate the relative error of the approximate answers. Key-Words: Analytic query processing, Approximate query answer, Polynomial approximation, Accuracy estimation, Probabilistic model. 1 Introduction In data warehousing (DW) environments, On-Line Analytical Processing (OLAP) is devoted to extract information to be used in decision making activities. DW databases commonly store high volumes of data and, therefore, even a simple scanning for data aggregation requires answer times that may range from minutes to hours [1, 2]. Moreover, in case of a distributed system, the access to remote data can be sometimes impracticable (due to the server and/or communication line crash, for example). The traditional query processing engines always provide decision makers with exact query answers. However, in many cases, the total precision is not always required by final users. Indeed, there are several scenarios in OLAP where it is not manda- tory to obtain exact query answers. As an example, in the drill-down task, the preliminary queries are used only to determine the most important facets to consider in decision making. In fact, analytical processing has always an unpredictable and explora- tory nature. As a further example, in numerical an- swers that require the computation of the average of a large set of data, the total precision is not needed and an approximate value will suffice. At last, in Database Management Systems, the optimizers must define a plan for the physical data access on the basis of the selectivity estimates [3]. These issues have led to define methodologies able to provide approximate query answers [4], especially in data warehousing environments [5], such that the final users get fast responses, although affected with a small quantity of error. In fact, this research topic is based on the assumption that, if the query is very time-consuming and the error in the approximation is neglectable, then it is more suit- able to have an approximate answer in a short time, rather than the exact answer after a long waiting time. What is an approximate answer? As concerns a query that involves data aggregation, the answer is a scalar value and the relative approximate answer is an estimation of this value. The most popular meth- odologies that represent the theoretical bases for approximate query processing are based on sampl- ing [6], histograms [7], wavelets [8, 9], probabilistic models [10, 11, 12], distributed processing [13], clustering [14], orthonormal series approximation [15], genetic programming [16], and graph-based models [17]. Most of them need to perform a reduc- tion of the data stored in database relations. In fact, these methodologies provide approximate answers using small and pre-computed data synopses, ob- tained by compressing the original data [18]. According to [19], the criteria for comparing the methodologies for approximate query processing WSEAS TRANSACTIONS on COMPUTERS Carlo Dell'Aquila, Francesco Di Tria, Ezio Lefons, Filippo Tangorra ISSN: 1109-2750 1075 Issue 10, Volume 9, October 2010