Event-based lossy compression for effective and efficient OLAP over data streams Alfredo Cuzzocrea a, * , Sharma Chakravarthy b a ICAR-CNR and University of Calabria, Rende, Cosenza 87036, Italy b The University of Texas at Arlington, Arlington, TX 76019-0015, USA article info Article history: Available online 20 February 2010 Keywords: Data stream query processing Data stream compression methodologies and techniques Knowledge discovery from data streams OLAP over data streams Event-based data stream processing Event-based data stream compression abstract An innovative event-based lossy compression model for effective and efficient OLAP over data streams, called ECM-DS, is presented and experimentally assessed in this paper. The main novelty of our compression approach with respect to traditional data stream com- pression techniques relies on exploiting the semantics of the reference application scenario in order to drive the compression process by means of the ‘‘degree of interestingness” of events occurring in the target stream. This finally improves the quality of retrieved approx- imate answers to OLAP queries over data streams, and, in turn, the quality of complex knowledge discovery tasks over data streams developed on top of ECM-DS, and imple- mented via ad-hoc data stream mining algorithms. Overall, the compression strategy we propose in this research puts the basis for a novel class of intelligent applications over data streams where the knowledge on actual streams is integrated-with and correlated-to the knowledge related to expired events that are considered critical for the target OLAP anal- ysis scenario. Finally, a comprehensive experimental evaluation over several classes of data stream sets clearly confirms the benefits deriving from the event-based data stream com- pression approach proposed in ECM-DS. Ó 2010 Elsevier B.V. All rights reserved. 1. Introduction 1.1. Preliminaries The problem of efficiently representing [5] and mining [37] data streams is of relevant interest for both the Database and Data Mining research communities. Basically, data stream query processing poses novel and previously-unrecognized re- search challenges that make traditional DBMS technology (e.g., RDBMS) inadequate to the goal of dealing with the un- bounded nature of data streams. In fact, while information stored in relational databases is represented by means of high- detailed fine-grain tuples, and database query processing algorithms are multi-step accordingly, a data stream cannot be rep- resented in great detail as it is, potentially, infinite. As a consequence, data stream query processing algorithms typically run within the context of specialized bounded time windows [1], which collect sets of stream readings (e.g., the last T readings, with T > 0), under the constraint of applying in one-pass only. Thus, it is not feasible in practice to devise reliable multi-step data stream query processing algorithms inspired by conventional database query processing algorithms. On the other hand, data stream mining algorithms aim at discovery interesting knowledge from multi-rate rapidly-evolv- ing streams by means of clustering [53], classification (e.g., [4]), decision trees (e.g., [32]), OLAM (e.g., [25]), and so forth. 0169-023X/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2010.02.006 * Corresponding author. E-mail addresses: cuzzocrea@si.deis.unical.it (A. Cuzzocrea), sharma@cse.uta.edu (S. Chakravarthy). URLs: http://si.deis.unical.it/~cuzzocrea (A. Cuzzocrea), http://itlab.uta.edu/sharma/ (S. Chakravarthy). Data & Knowledge Engineering 69 (2010) 678–708 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak