Access Methods for Markovian Streams University of Washington Technical Report UW TR: #TR08-07-01 Julie Letchner #1 , Christopher R´ e #2 , Magdalena Balazinska #3 , Matthai Philipose ∗4 # Computer Science & Engineering Department, University of Washington Seattle, Washington, USA 1 letchner, 2 chrisre, 3 magda@cs.washington.edu ∗ Intel Research Seattle Seattle, Washington, USA 4 matthai.philipose@intel.com Abstract Model-based views have recently been proposed as an effective method for querying noisy sensor data. Commonly used models from the AI literature (e.g., the Hidden Markov Model) expose to applications a stream of probabilistic and correlated state estimates computed from the sensor data. Many applications want to detect sophisticated patterns of states from these Markovian streams. Such queries are called event queries. In this paper, we present a new system, Caldera, for processing event queries over stored Markovian streams. At the heart of our system is a set of access methods for Markovian streams that can improve event query performance by orders of magnitude compared to existing techniques, which must scan the entire stream. These access methods use new adaptations of traditional B+ tree indexes, and a new index, called the Markov-chain index, to efficiently extract only those parts of the stream potentially relevant to the query while retaining the stream’s Markovian properties. We have implemented our prototype system on BDB and demonstrate its effectiveness on both synthetic data and real data from a building-wide RFID deployment. 1 Introduction Applications that make critical decisions based on sensor data are increasingly common, with sensor deployments now playing integral roles in supply chain automation [5, 39], environment monitoring [17], elder-care [25, 28], etc. Unfortu- nately, building applications on top of raw sensor data remains challenging because sensors produce inaccurate information, frequently fail, and can rarely collect data on an entire region of interest. As an example, consider a Radio Frequency IDen- tification (RFID) tracking application [38] in which RFID readers are distributed throughout an environment. Ideally, when a tag (carried by a person or attached to an object) passes in the vicinity of a reader, the reader detects and logs the tag’s presence: e.g., Bob’s tag was sighted by reader A at time 7, reader B at time 8, etc. In practice, however, readers often fail to detect nearby tags [40], forcing applications to deal with sparse and noisy input streams. The reduction of errors and gaps in sensor data streams is the focus of a large body of techniques developed in the AI community [34]. While a limited number of these techniques can be applied in real time, the most effective (Bayesian smoothing [13]) can be applied only as a post-processing step, after the raw data stream is stored on disk. Our goal is to support archive-based applications that leverage this smoothed data in order to provide the most accurate possible answers to historical queries (e.g., “Was Bob in his office yesterday?”, “Did Margot take her medication before breakfast every day last month?”, etc.). The result of any smoothing technique is a probabilistic stream, in which each timestep encodes not a single state, but a distribution over possible states. In the RFID tracking example, such a stream might indicate, for each timestep, the distribution over possible locations of a tag: e.g., at time 7, Bob was in the hallway with probability 0.8 and in his office with probability 0.2. Additionally, states at consecutive timesteps can be correlated: e.g., Bob’s location at time 8 is correlated with 1