A Parallel Algorithm to Compute Data Synopsis Carlo DELL’AQUILA, Francesco DI TRIA, Ezio LEFONS, and Filippo TANGORRA Dipartimento di Informatica Università degli Studi Bari “Aldo Moro” via Orabona 4, 70125 Bari ITALY {dellaquila, francescoditria, lefons, tangorra}@di.uniba.it Abstract: - Business Intelligence systems are based on traditional OLAP, data mining, and approximate query processing. Generally, these activities allow to extract information and knowledge from large volumes of data and to support decisional makers as concerns strategic choices to be taken in order to improve the business processes of the Information System. Among these, only approximate query processing deals with the issue of reducing response time, as it aims to provide fast query answers affected with a tolerable quantity of error. However, this kind of processing needs to pre-compute a synopsis of the data stored in the Data Warehouse. In this paper, a parallel algorithm for the computation of data synopses is presented. Key-Words: - data warehouse, approximate query processing, data synopsis, parallel algorithm, message passing interface. 1 Introduction The core of a Business Intelligence system is represented by a Data Warehouse, which is a repository built to store large volume of data. Such data are obtained by the integration of operational data, coming from heterogeneous data sources (such as relational databases, XML files, Excel documents) adopted in transactional systems. Since a Data Warehouse is used in the decision making process, it must be designed in order to support statistical analyses of data [1, 2]. Moreover, it must allow analyses based on temporal series. For this reason, a Data Warehouse must not only integrate data coming from operational databases but also preserve historical data, accumulating data over time. In this way, it is clear that the cardinality of the tables of the Data Warehouse increases very fast, because records are inserted when the Data Warehouse is fed and never deleted. Thus, the data stored in a Data Warehouse are used in the On-Line Analytical Processing (OLAP), in order to produce information to be used in the decision making process. OLAP consists of a set of analytical queries, essentially based on statistical functions, and typical OLAP operators, as drill- down, roll-up, slice-and-dice, and pivoting [3]. In particular, the statistical functions based on the SQL aggregation and grouping operators require usually a long answer time [4, 5]. As the results of statistical computations are used for business strategic choices, often decisional makers are not interested in exact values but approximate values will suffice. In fact, in this context, it may be more suitable to obtain approximate values quickly, rather than exact values, requiring a high answer time. Nowadays, there are several systems supporting approximate query answering, based on different methodologies, such as wavelet [6], sampling [7, 8], and graph-based modelling [9]. In spite of the adopted methodology, all these systems share the following process model: (a) calculating the data synopsis and (b) using the calculated data to execute analytical queries [10]. It has been widely verified that these systems are able to produce answers in lower time than traditional systems, with an acceptable percentage of error [11]. In particular, the methodology presented here is based on the analytical data profile [12]. According to this methodology, the data synopsis is represented by a set of computed values, the so-called Canonical Coefficients (CCs), that contain information about the multivariate distribution of the data stored in the Data Warehouse. The computational time to obtain these coefficients is very high. The number of coefficients that must be generated depends on both the approximation function degree (in fact, the CCs are the coefficients of the approximation polynomials) and the number of attributes involved in the computation. Moreover, the computational process needs to scan the entire relations. For these reasons, the computational time depends on (a) the degree of approximation, (b) the number of attributes, and (c) the cardinality of the relation. As the generation time of the data synopsis is a standard criterion to evaluate approximate query answering methodologies, here we investigate an extension of our analytic method by designing a WSEAS TRANSACTIONS on INFORMATION SCIENCE and APPLICATIONS Carlo Dell'aquila, Francesco Di Tria, Ezio Lefons, Filippo Tangorra ISSN: 1790-0832 691 Issue 5, Volume 7, May 2010