A Quality-Centric Data Model for Distributed Stream Management Systems Marco Fiscato Department of Computing Imperial College London London, United Kingdom mfiscato@doc.ic.ac.uk Quang Hieu Vu Department of Computing Imperial College London London, United Kingdom qhvu@doc.ic.ac.uk Peter Pietzuch Department of Computing Imperial College London London, United Kingdom prp@doc.ic.ac.uk ABSTRACT It is challenging for large-scale stream management systems to re- turn always perfect results when processing data streams originat- ing from distributed sources. Data sources and intermediate pro- cessing nodes may fail during the lifetime of a stream query. In ad- dition, individual nodes may become overloaded due to processing demands. In practice, users have to accept incomplete or inaccurate query results because of failure or overload. In this case, stream processing systems would benefit from knowing the impact of im- perfect processing on data quality when making decisions about query optimisation and fault recovery. In addition, users would want to know how much the result quality was degraded. In this paper, we propose a quality-centric relational stream data model that can be used together with existing query processing methods over distributed data streams. Besides giving useful feed- back about the quality of tuples to users, the model provides the distributed stream management system with information on how to optimise query processing and enhance fault tolerance. We demon- strate how our data model can be applied to an existing distributed stream management system. Our evaluation shows that it enables quality-aware load-shedding, while introducing only a small per- tuple overhead. 1. INTRODUCTION Today’s distributed stream management systems (DSMSs) must support a class of applications that process continuous queries over a geographically-distributed set of data stream sources. Applica- tions in many domains fall under this pattern. In healthcare, a DSMS may monitor behaviours of patients and elderly citizens across a metro area and signal emergency attention in real-time when necessary [22]. In supply chain management, DSMSs may supervise manufacturing chains to detect shipping delays before they affect production [12]. In an urban-scale sensing infrastruc- ture [26], such systems may collect and analyse weather data to generate real-time notifications about air pollution levels and severe weather conditions affecting road users. More detail on a variety of sensor network applications can be found in [34]. Previous research on DSMSs focused primarily on high-volume Permission to make digital or hard copies of all or part of this work for personal, academic, or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. International Workshop on Quality in Databases (QDB) August 24, 2009, Lyon, France Copyright 2009 Universit¨ at T¨ ubingen and University of Rennes 1.. financial data processing in single data centres [1, 4]. Such ap- plications require perfect [5] and highly-available [16] data pro- cessing. They benefit from resource over-provisioning in terms of computational nodes and high-speed networks to cope with fail- ure and workload peaks. In contrast, the large-scale DSMSs de- scribed above face a more hostile environment. Data sources are widely distributed and therefore only interconnected through unre- liable wide-area network links. A set of heterogeneous processing nodes may be spread around the infrastructure at various locations, with different failure behaviour and under different administrative control [27]. In such a deployment environment, stream processing failures occur frequently due to faulty hardware, software bugs, overloaded nodes and network faults or partitions. The DSMS may not have sufficient resources to recover all lost processing after failure. While users in many domains can accept incomplete query results, they want to know about the degree of quality degradation due to im- perfect processing. Answering this question is actually a challenge for existing query processing methods since they aim for perfect processing, masking the effects of failure through redundant pro- cessing or re-processing of missed tuples [15]. Imperfect stream processing usually indicates a catastrophic failure of the system. To address this problem, we propose a new quality-centric stream data model. In this model, streams have associated meta-data about weight, recall and utility that estimates how imperfect processing has affected the quality of tuples in a stream. This model is inde- pendent of specific query semantics and can be used with existing relational DSMSs to provide continuous feedback to users on the achieved processing level. It also enables the DSMS to identify important data streams and, for example, replicate them in advance to mask future failure or optimise query processing under resource shortage by dropping least important data tuples first. The paper makes the following three main contributions: (1) We describe a quality-centric data model that provides users with use- ful information about incompleteness of query results due to fail- ures. The data model is also designed to give the system feedback on the importance of data streams when making resource allocation decisions. (2) We demonstrate how to apply our data model to es- timate the correctness of query results, optimise query processing and replicate important data streams to mask failure. (3) We present our implementation of the data model as part of Borealis [1], an ex- isting state-of-the-art DSMS, and evaluate its performance. The rest of the paper is organised as follows. In Section 2, we discuss related work. In Section 3, we state our assumptions about the DSMS. We describe the quality-centric data model in Section 4. Three use cases for the data model are introduced in Section 5. In Section 6, we describe how to implement the data model over an existing DSMS. Our experimental evaluation that highlights bene-