International Journal of Computer Applications (0975 – 8887) Volume 94 – No. 18, May 2014 28 Distributed Sequential Pattern Mining: A Survey and Future Scope Suhasini Itkar Assistant Professor Department of Computer Engineering Modern college of Engineering Pune, India Uday Kulkarni Professor Department of Computer Science & Engineering SGGS Institute of Engineering and Technology Nanded, India ABSTRACT Distributed sequential pattern mining is the data mining method to discover sequential patterns from large sequential database on distributed environment. It is used in many wide applications including web mining, customer shopping record, biomedical analysis, scientific research, etc. A large research has been done on sequential pattern mining on various distributed environments like Grid, Hadoop, Cluster, Cloud, etc. Different types of sequential pattern mining can be performed are sequential patterns, maximal sequential patterns, closed sequences, constraint based and time interval based sequential patterns. This paper presents a systematic review on work done for sequential pattern mining and advanced sequential pattern mining on distributed environment. This paper finally presents future research directions related to sequential pattern mining in distributed environment. General Terms Association Rule Mining, Sequential Pattern Mining. Keywords Distributed Sequential Pattern Mining, Maximal Patterns, Constraint based Patterns, Distributed environment. 1. INTRODUCTION Sequential pattern mining is discovering sequential patterns from large sequence database. Sequential pattern can be widely used in customer purchase patterns for inventory control, web access patterns for websites, analysis of sequences or time related processes such as scientific experiments, natural disasters, disease treatment, analysis of DNA sequences, etc. The problem of finding sequential pattern was first proposed in [1]. There are different approaches to mine sequential patterns like Apriori-based algorithm GSP [2], SPAM [3], projection-based FreeSpan [4], PSPM [5], vertical data format based algorithm SPADE [6] and pattern growth based approach in UDDAG [7] have been proposed. There are different specialized ways to find the sequential patterns which are mining of multidimensional association rules involve more than one dimension [8], mining of closed patterns [9] [10], maximal patterns [11], Constraint based mining [12] [13], approximate patterns [14]. Above mentioned algorithms are mainly executed on standalone environment which has some drawbacks like large scanning time for database, scalability problem, less efficient for massive dataset. To improve the performance of sequential pattern mining and to improve the scalability issues many researchers provide different techniques to work on distributed environment like grid computing, cluster, cloud, Hadoop, etc. and distribute the mining computation over more than one node. The remaining paper is organized as follows. We define the theoretical foundations and related work of distributed sequential pattern mining in section 2. Taxonomy of various algorithms in distributed sequential pattern mining is mentioned in Section 3. Section 4 addresses comparative analysis of distributed sequential pattern mining algorithms. Section 5 conclude the study and explain some challenging issues for future scope. 2. THEORETICAL FOUNDATION AND RELATED WORK This section represents the problem statement and attributes for sequential and distributed sequential pattern mining and related work done for sequential pattern mining in distributed environment. 2.1 Problem Statement Let          be a set of all items. A subset of I is called an itemset. A sequence             is an ordered list [3]. Each itemset in a sequence represents a set of events happening at the same timestamp, while different itemsets occur at different times. For example, a customer shopping sequence could be buying several products on one trip to the store and making several subsequent purchases, e.g., buying a PC then antivirus and some software, followed by buying a digital camera, memory card and a card reader, and finally buying a printer. Without loss of generality, we assume that the items in each itemset are sorted in certain order (such as alphabetic order or ascending order). Definition 1. Sequential Pattern Mining: A sequence         is a sub-sequence of another sequence          , denoted by  ( if    , written as   ), if and only if        such that         and                     . We also call  a supersequence of , and  contains . Given a sequence database          , the support of a sequence  is the number of sequence in  which contain . If the support of a sequence  satisfies a pre-defined   threshold, is a frequent sequential pattern. Definition 2. Distributed Sequential Pattern Mining: Let                  be a set of all items. A sequence database  is a set of tuples which contains  sequence id and element sequence. The support or frequency  of a sequence  in sequence database means   of the