5 VOGUE: A Variable Order Hidden Markov Model with Duration Based on Frequent Sequence Mining MOHAMMED J. ZAKI, CHRISTOPHER D. CAROTHERS, and BOLESLAW K. SZYMANSKI Rensselaer Polytechnic Institute We present VOGUE, a novel, variable order hidden Markov model with state durations, that com- bines two separate techniques for modeling complex patterns in sequential data: pattern mining and data modeling. VOGUE relies on a variable gap sequence mining method to extract frequent patterns with different lengths and gaps between elements. It then uses these mined sequences to build a variable order hidden Markov model (HMM), that explicitly models the gaps. The gaps implicitly model the order of the HMM, and they explicitly model the duration of each state. We apply VOGUE to a variety of real sequence data taken from domains such as protein sequence classification, Web usage logs, intrusion detection, and spelling correction. We show that VOGUE has superior classification accuracy compared to regular HMMs, higher-order HMMs, and even special purpose HMMs like HMMER, which is a state-of-the-art method for protein classification. The VOGUE implementation and the datasets used in this article are available as open-source. 1 Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications— Data mining; I.2.6 [Artificial Intelligence]: Learning; I.5.1 [Pattern Recognition]: Models; G.3 [Probability and Statistics]: Probability and Statistics—Markov processes General Terms: Algorithms Additional Key Words and Phrases: Hidden Markov models, higher-order HMM, HMM with dura- tion, sequence mining and modeling, variable-order HMM ACM Reference Format: Zaki, M. J., Carothers, C. D., and Szymanski, B. K. 2010. VOGUE: A variable order hidden Markov model with duration based on frequent sequence mining. ACM Trans. Knowl. Discov. Data. 4, 1, Article 5 (January 2010), 31 pages. DOI = 10.1145/1644873.1644878 http://doi.acm.org/10.1145/1644873.1644878 1 www.cs.rpi.edu/zaki/software/VOGUE. This work was supported in part by NSF Grants EMT-0829835 and CNS-0103708, and NIH Grant 1R01EB0080161-01A1. Authors’ address: Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180; email: {zaki, chrisc, szymansk}@cs.rpi.edu. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. C 2010 ACM 1556-4681/2010/01-ART5 $10.00 DOI 10.1145/1644873.1644878 http://doi.acm.org/10.1145/1644873.1644878 ACM Transactions on Knowledge Discovery from Data, Vol. 4, No. 1, Article 5, Publication date: January 2010.