CONTOUR: An Eﬃcient Algorithm for Discovering Discriminating Subsequences Jianyong Wang 1 , Yuzhou Zhang 1 , Lizhu Zhou 1 , George Karypis 2 , and Charu C. Aggarwal 3 1 Tsinghua University, Beijing, 100084, China 2 University of Minnesota, Minneapolis, MN 55455, USA 3 IBM T.J. Watson Research Center, Hawthorne, NY 10532, USA Abstract. In recent years we have witnessed several applications of fre- quent sequence mining, such as feature selection for protein sequence classiﬁcation and mining block correlations in storage systems. In typi- cal applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to ﬁnd the com- plete set of frequent subsequences. Then, a subset of interesting subse- quences can be further identiﬁed. Unfortunately, it is very time consum- ing to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which eﬃciently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some eﬀective search space pruning methods to accelerate the mining process and dis- cuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the eﬃciency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm. Keywords. Sequence mining, discriminating subsequence, summarization subse- quence, clustering. 1 Introduction As a sequence can be used to naturally model the temporal ordering relation- ship among a set of events, and abundant sequence data have emerged in recent years such as DNA string, protein sequence, Web log data, and so on, pattern discovery from sequence databases has attracted much attention in data mining research area. A fundamental problem formulation is the sequential pattern mining problem [3], which ﬁnds the complete set of frequent (closed) subse- quences from an input sequence database. Various eﬃcient sequential pattern mining algorithms have been proposed [3, 19, 26, 33, 21, 4] in recent years. There exist several shortcomings of traditional sequential pattern mining which hinder its wide application. The ﬁrst one is its huge result set. It is