A Dynamic Bit-vector Approach for Efficiently Mining Inter-sequence Patterns Bay Vo, Minh-Thai Tran Information Technology College Ho Chi Minh City, Viet Nam {vdbay,minhthai}@itc.edu.vn Hieu Nguyen Yellow Pepper Viet Nam Ho Chi Minh City, Viet Nam minhhieu052@gmail.com Tzung-Pei Hong Department of CSIE National University of Kaohsiung Kaohsiung City, Taiwan, R.O.C tphong@nuk.edu.tw Bac Le Department of Computer Science University of Science Ho Chi Minh, Viet Nam lhbac@fit.hcmus.edu.vn AbstractThe inter-sequence pattern (ISP) mining method can be used to mine sequential patterns inside a transaction and inter-transaction patterns in several transactions. Consequently, the ISP mining method is more general than two traditional sequence mining methods. This paper proposes an algorithm that uses a dynamic bit-vector (DBV) data structure to efficiently mine ISPs. The DBV-ISP algorithm uses the divide-and- conquer method to reduce the required storage space and execution time. Experimental results show that DBV-ISP is more efficient than the EISP-Miner algorithm in terms of execution time and memory usage. Keywords—BitTable, Dynamic bit vector, Inter-sequence pattern, Sequence pattern, Vertical data format. I. INTRODUCTION Sequential pattern mining from sequence databases is an important issue in data mining [4, 5, 7, 12]. Many of algorithms have thus been proposed for mining sequential patterns in sequence databases [1-2, 5-6, 8-12, 15-16]. However, these algorithms treat sequences independently without considering the relationship between sequences. Several algorithms for mining frequent sequential patterns through lots of transactions in sequence databases produced the remarkable results. However, these algorithms do not consider the ordered relationship between items within a transaction, with the items treated as an unordered set. Wang and Lee [14] proposed an algorithm for mining inter- the EISP-Miner algorithm that mines frequent inter- sequence patterns across several transactions in sequence databases. The EISP-Miner algorithm considers the items within several transactions in sequence databases as an ordered set. Hence, it is more general than existing algorithms. However, this algorithm consumes a lot of memory for storing transaction identifiers in a tree and it requires a lot of time to find extended sequences when creating new patterns. To solve these two problems, this study proposes an algorithm that uses a dynamic bit vector (DBV) data structure to efficiently mine inter-sequence patterns. The proposed DBV-ISP algorithm uses the compressed sequence mechanism and a divide-and-conquer method to reduce the required storage space and execution time. II. PRELIMINARY CONCEPTS Consider a sequence database with a set of items I = {i 1 , i 2 ,…,i n }, where i j is an item (1jn). A sequence S = <t 1 , t 2 ,…,t m > is an ordered list of itemsets, where t j is an itemset for 1jm. A sequence database D = {s 1 , s 2 ,…, s |D| }, where |D| is the number of sequences in D and s i (1i |D|) is a transaction in the form <Dat, sequence>, where Dat is a domain attribute of s i used to describe contextual information by the time. Consider sequence Dat 1 in Table 1. it indicates that customer buys item C, and then items AB. Table 1: Sequence database Let t 1 , t 2 be two Dat values for sequences s 1 and s 2 , respectively. If t 1 is taken as the reference point, the span between s 1 and s 2 is defined as [t 2 – t 1 ]. Sequence s 2 at domain attribute t 2 with respect to t 1 is called an extended sequence (e-sequence) and denoted as s 2 [t 2 – t 1 ]. For Dat Sequence Megasequence (maxspan=1) 1 <C(AB)> <C(AB)>[0]<C(ABC)A[1] 2 <C(ABC)A> <C(ABC)A>[0]<AD>[1] 3 <AD> <AD>[0]<A>[1] 4 <A> <A>[0]<AC>[1] 5 <AC> <AC>[0]<BC>[1] 6 <BC> <BC>[0]<(AB)C>[1] 7 <(AB)C> <(AB)C>[0]<E>[1] 8 <E> <E>[1] 2012 Third International Conference on Innovations in Bio-Inspired Computing and Applications 978-0-7695-4837-1/12 $26.00 © 2012 IEEE DOI 10.1109/IBICA.2012.31 51