Mining surprising patterns using temporal Soumen Chakrabarti Sunita Sarawagi IBM Almaden Research Center 650 Harry Road, San Jose, CA 95120 Abstract We propose a new notion of surprising temporal pat- terns in market. basket data, and algorithms to find such pat,terns. This is distinct, from finding frequent pat-terns as addressed in the common mining literature. We argue that. once the analyst. is already familiar with prevalent patterns in t,he data, the greatest, increment,al benefit. is likely t,o be from changes in the relationship between item frequencies over time. A simple measure of surprise is the extent of depar- ture from a model, estimated using standard mult,ivariat,e t,ime series analysis. Unfortunately, such estimation in- volves models, smoothing windows and parameters whose optimal choices can vary dramatically from one application to another. In contrast,, we propose a precise characteri- zation of surprise based on the number of bits in which a basket. sequence can be encoded under a carefully chosen coding scheme. In this scheme it, is inexpensive to encode sequences of itemset,s t-hat have st,eady, hence likely t,o be well-known, correlation bet.ween items. Conversely, a se- quence with large code length hints at. a possibly surprising corrdat,ion. Ocr not,ion of surprise also has t,he desirable property t,hat, ihe score of a set. of items is offset. by anything sur- prising t,hat. the user may already know from t,he marginal distribution of any proper subset.. No parameters, such as support,, confidence, or smoothing windows, need to be estimated or specified by t.he user. WC: experimentred with real-life market. basket data. The algorithm successfully rejectBed a large number of frequent, sets cf items that, bore obvious and st,eady complemen- tary relations to each other, such as cereal and milk. In- st,ead, our algorithm found itemsets that showed statisti- cally st,rong fluctuations in correlation over time. These items had no obviously complementary roles. 1 Introduction Data warehousing technology has enabled corpora- tions to store huge amounts of data, and data min- ing has become a major motivating application. Large data sources suitable for mining are growing in number and size literally every passing moment. For almost any such data source collected over years to decades, there will be prevalent patterns or broad regularities that are already known to domain experts, (soumen,sunita,dom)@almaden.ibm.com and surprising patterns that are novel, unexpected and non-trivial to explain. There may be patterns of both types that are statistically significant. There is broad consensus [15, 21, 22, 231 that the success of data mining will depend critically on the ability to go beyond obvious patterns and find novel and use- ful patterns [12]. Otherwise the results of mining will often be large and lack novelty, making it overwhelm- ing and unrewarding for the analyst to sieve through them. A domain expert who is already familiar with the application domain is very unlikely to be satisfied with merely prevalent patterns, because (1) presum- ably the company is already exploiting them to the extent possible and (2) the competition knows about these patterns as well. The payoff from data mining lies in surprising second-order phenomena. An ill-defined, vague notion of “domain knowledge” gets in the way of separating the novel patterns from the prevalent ones. In principle, one can propose var- ious well-defined notions of domain knowledge. The analyst has a mental multivariate distribution over the attributes, and the system reports, from a certain class of patterns, those that reduce the distance between the mental distribution and the true distribution at the quickest rate. Of course, it is impossible to implement this in a real system. 1.1 Our contributions This paper proposes and explores the notion that anal- ysis of variation of inter-item correlations along time can approximate the role of domain knowledge in the search for interesting patterns. We concentrate on the problem of boolean market basket data [l, 21. A set of k items is declared as “interesting” not necessar- ily because its absolute support exceeds a user-defined threshold, but because the rehtionship between the items changes over time. Furthermore, even if the sup- port of the itemset changes over time, it is not consid- ered interesting if the changes are totally exphined by the changes in the support of smaller subsets of items. Permission to covv without fee all or part of this material is gmnted provided th”at the copies are not-made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication-and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 24th VLDB Conference New York, USA, 1998 We develop this notion of interest based on the num- ber of bits needed to encode a itemset sequence using a specific coding scheme that we design. In this scheme it takes relatively few bits to encode sequences of item- sets that have steady correlation between items (which are likely to be well-known). Conversely, a sequence with large code length (relative to a baseline uncon- strained coding scheme) hints at a possibly surprising correlation [19]. The surprise value of the itemset is re- lated to the difference or ratio between the constrained and unconstrained code lengths. As a subroutine in this computation, our analysis produces, in a formal 606 description length Byron Dom