Finding Relevant Patterns in Bursty Sequences Alexander Lachmann ∗ RWTH Aachen, Germany alexander.lachmann@rwth-aachen.de Mirek Riedewald Cornell University Ithaca, New York mirek@cs.cornell.edu ABSTRACT Sequence data is ubiquitous and finding frequent sequences in a large database is one of the most common problems when analyzing sequence data. Unfortunately many sources of sequence data, e.g., sensor networks for data-driven sci- ence, RFID-based supply chain monitoring, and comput- ing system monitoring infrastructure, produce a challenging workload for sequence mining. It is common to find bursts of events of the same type. Such bursts result in high mining cost, because input sequences are longer. An even greater challenge is that these bursts tend to produce an overwhelm- ing number of irrelevant repetitive sequence patterns with high support. Simply raising the support threshold is not a solution, because at some point interesting sequences will get eliminated. As an alternative we propose a novel trans- formation of the input sequences. We show that this trans- formation has several desirable properties. First, the trans- formed data can still be mined with existing sequence mining algorithms. Second, for a given support threshold the min- ing result can often be obtained much faster and it is usu- ally much smaller and easier to interpret. Third, and most importantly, we show that the result sequences retain the important characteristics of the sequences that would have been found in the original (not transformed) data. We val- idate our technique with an experimental study using syn- thetic and real data. Keywords Frequent sequence mining, data transformation, event stream, bursts, temporal data mining 1. INTRODUCTION Sequence data is ubiquitous and mining this data to find patterns is a challenging problem for many applications [12]. In this paper we focus on the important problem of find- ing frequent subsequences in a set of given input sequences. ∗ Work done while visiting Cornell University. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB ‘08, August 24-30, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. Traditionally frequent sequence mining is used to discover purchase patterns in sales transaction data. Consider a cus- tomer’s purchase history like {chocolate, chips, water}→ cheese →{broccoli, carrots}. Here the set step {} indi- cates that products were purchased in the same transac- tion, while the sequence step → indicates that the products on the left were purchased earlier than those on the right. A subsequence like {chocolate, chips}→{broccoli, carrots} could indicate that after indulging on sweets and snacks, the customer feels guilty and purchases healthy vegetables. 1 If many customer sequences contain this subsequence, stores can take advantage of such patterns for targeted advertise- ment or promotions. Discovery of common patterns of page visits in Web logs can help in improving the design of a Web site or in deciding what advertisements to present to Web surfers. By finding common sequences of hardware or software related events (errors, warnings, status events) that lead to critical system failures, system administrators can take active measures for re-configuring/re-designing systems or for preventive main- tenance. Similarly, there is strong interest in finding fre- quent patterns in other inherently sequential data like RFID readings in supply chain monitoring and readings from sen- sors monitoring natural or industrial processes. Last but not least, frequent sequence mining has also been applied to DNA data and medical treatment sequence analysis. Frequent sequence mining is concerned with finding se- quences that are contained in a large fraction of input se- quences, i.e., subsequences that have a high support. Re- turning to the purchase analysis example, a sequence pat- tern is frequent if it occurs in many customer sequences. An input sequence can support a number of subsequences that is exponential in its size. This makes frequent sequence mining for long sequences expensive. In this paper we concentrate on a problem that is com- mon in all the above mentioned applications concerned with mining of event logs—bursts of common events. Consider a large digital printing machine for industrial scale docu- ment printing. Complex systems like this continuously pro- duce events reporting status of components (e.g., currents at various electronic components or motors, resets of system components), less severe problems (e.g., paper jams, excep- tions reported by firmware, too early or too late arrival of paper at various sensors), or critical errors (transport mo- tor faults, open interlocks during run). If the wrong paper 1 Notice that transactions in the subsequence do not have to be adjacent to each other in the input sequence; transactions in between can be “skipped over”.