Efficient Episode Mining of Dynamic Event Streams Debprakash Patnaik * , Srivatsan Laxman † , Badrish Chandramouli ‡ and Naren Ramakrishnan § * Amazon.com, Seattle, WA 98109, USA; E-mail: patnaikd@amazon.com † Microsoft Research, Bangalore, India 560080; E-mail: slaxman@microsoft.com ‡ Microsoft Research, Redmond, WA 98052; E-mail: badrishc@microsoft.com § Department of Computer Science, Virginia Tech, Blacksburg, VA 24061; E-mail: naren@vt.edu Abstract—Discovering frequent episodes over event sequences is an important data mining problem. Existing methods typically require multiple passes over the data, rendering them unsuitable for streaming contexts. We present the first streaming algorithm for mining frequent episodes over a window of recent events in the stream. We derive approximation guarantees for our algorithm in terms of: (i) the separation of frequent episodes from infrequent ones, and (ii) the rate of change of stream characteristics. Our parameterization of the problem provides a new sweet spot in the tradeoff between making distributional assumptions over the stream and algorithmic efficiencies of mining. We illustrate how this yields significant benefits when mining practical streams from neuroscience and telecommunications logs. Index Terms—Event Sequences; Data Streams; Frequent Episodes; Pattern Discovery; Streaming Algorithms; Approxi- mation Algorithms I. I NTRODUCTION Application contexts in telecommunications, neuroscience, and intelligence analysis feature massive data streams [1] with ‘firehose’-like rates of arrival. In many cases, we need to analyze such streams at speeds comparable to their generation rate. In neuroscience, one goal is to track spike trains from multi-electrode arrays [2] with a view to identify cascading circuits of neuronal firing patterns. In telecommunications, network traffic and call logs must be analyzed on a continual basis to detect attacks or other malicious activity. The common theme in all these scenarios is the need to mine episodes (i.e., a succession of events occurring frequently, but not necessarily consecutively [3]) from dynamic and evolving streams. Algorithms for pattern mining over streams have become increasingly popular over the recent past [4]–[7]. Manku and Motwani [4] introduced a lossy counting algorithm for approx- imate frequency counting over streams, with no assumptions on the stream. Their focus on a worst-case setting often leads to stringent threshold requirements. At the other extreme, algo- rithms such as [5] provide significant efficiencies in mining but make strong assumptions such as i.i.d distribution of symbols in a stream. In the course of analyzing some real-world datasets, we were motivated to develop new methods as existing methods are unable to process streams at the rate and quality guarantees desired (see Sec. VI for some examples). Furthermore, estab- lished stream mining algorithms are almost entirely focused on itemset mining (and, modulo a few isolated exceptions, just the counting phase of it) whereas we are interested in mining general episodes. Our specific contributions are as follows: • We present the first algorithm for mining episodes in a stream. Unlike prior streaming algorithms that focus almost exclusively on counting, we provide solutions for both candidate generation and counting over a stream. • Devoid of any statistical assumptions on the stream (e.g., independence or otherwise), we develop a novel error characterization for streaming episodes by identifying and tracking two key properties of the stream, viz. maximum rate of change and top-k separation. We demonstrate how the use of these two properties enables novel algorithmic optimizations, such as the idea of borders to amortize work as the stream is tracked. • Although our work is geared towards episode mining, we adopt a black-box model of an episode mining algo- rithm. In other words, our approach can encapsulate and wrap around any pattern discovery algorithm to enable it to accommodate streaming data. This significantly generalizes the scope and applicability of our approach as a general methodology to streamify existing pattern discovery algorithms. • We demonstrate successful applications in neuroscience and telecommunications log analysis, and illustrate sig- nificant benefits in runtime, memory usage, and the scales of data that can be mined. We compare against episode- mining adaptations of two typical algorithms [5] from streaming itemsets literature. II. PRELIMINARIES In the framework of frequent episodes [3], an event se- quence is denoted as 〈(e 1 ,τ 1 ),..., (e n ,τ n )〉, where (e i ,τ i ) represents the i th event; e i is drawn from a finite alphabet E of symbols (called event-types) and τ i denotes the time-stamp of the i th event, with τ i+1 ≥ τ i , i =1,..., (n − 1). An ℓ- node episode α is defined by a triple α =(V α ,< α ,g α ), where V α = {v 1 ,...,v ℓ } is a collection of ℓ nodes, < α is a partial order over V α and g α : V α →E is a map that assigns an event- type g α (v) to each node v ∈ V α . An occurrence of an episode α is a map h : V α →{1,...,n} such that e h(v) = g α (v) for all v ∈ V α and for all pairs of nodes v,v ′ ∈ V α such that v< α v ′ the map h ensures that τ h(v) <τ h(v ′ ) . Two occurrences of an episode are non-overlapped [8] if no event corresponding to one appears in-between the events corresponding to the other. The maximum number of non-overlapped occurrences of an episode is defined as its frequency in the event sequence.