RAJASHREE SHETTAR ISSN: 22503676 [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Volume-2, Issue-2, 204 208 IJESAT | Mar-Apr 2012 Available online @ http://www.ijesat.org 204 SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract Sequential Pattern Mining involves applying data mining methods to large web data repositories to extract usage patterns. The growing popularity of the World Wide Web, many websites typically experience thousands of visitors every day. Analysis of who browsed what, can give important insight into the buying pattern of existing customers. Correct and timely decisions made based on this knowledge have helped organizations in reaching new heights in the market. In this paper, the sequence tree algorithm is implemented for pattern mining and this is experimented on web log data. The web log data which is considered as secondary data of the web has been considered for the discovery of frequent sequential patterns. The results have shown that the sequence tree algorithm performs better than the well-known Generalized Sequential Pattern (GSP) algorithm. The experiment shows that the running time of sequence tree algorithm is faster than the standard GSP algorithm and also Sequence Tree algorithm discovers more number of patterns than the standard GSP algorithm. Index Terms: Web usage mining, sequential patterns, sequence tree, web log data. --------------------------------------------------------------------- *** ------------------------------------------------------------------------ 1. INTRODUCTION Knowledge extraction from the World Wide Web has become an important and challenging task as enormous amount of data in form of semi-structured nature is available. Web mining approaches such as web content mining, web structure mining and web usage mining are available [1, 2, 3] which help to extract and produce useful knowledge from the web. Web usage mining is used for applications such as customer shopping sequences, web clicks, biological sequences, creation of dynamic user profiles, etc. In this paper, the discovery of frequent patterns from web log data is being considered. A web server usually registers a log entry for every access of a web page. First, raw web log data needs to be cleaned, condensed and transformed in order to retrieve and analyze significant and useful information. Second, pattern mining can be performed on log records to find association patterns, sequential patterns and trends of web accessing. With the use of such patterns, studies have been conducted on analyzing system performance, improving system design by web caching, web page pre-fetching, understanding the nature of web traffic, etc. The process of web usage mining consists of three major steps [4] (1) data pre-processing, (2) pattern discovery, and (3) pattern analysis stages. Log files are stored on the server side, on the client side and on the proxy servers. In this work, only the server side web log data is being used which is the Click stream data set [5]. The data pre-processing phase includes the data cleaning, user identification, session identification and data transformation respectively. The pattern discovery phase involves the discovery of frequent sequences. The pattern analysis phase involves the analysis of the frequent patterns generated by the pattern discovery phase. Many different techniques for mining frequent sequential patterns from the log data have been proposed in the recent past. The authors in [10] discuss that the candidate- generation-and-test approach outperforms the pattern-growth approach on mining short patterns, while pattern-growth approach is better on mining long patterns. The authors in [11] discuss the process of web log mining. Sequence mining is accomplished in [12], where a so-called WAP-tree is used for storing the patterns efficiently. Tree-like topology patterns and frequent path traversals are searched by [13, 14, 15, 16]. 2. DATA PRE-PROCESSING Data pre-processing phase involves data cleaning, user identification and session identification and data transformation [4]. In data cleaning stage phase, the web log is examined and irrelevant or redundant items such as image, sound, video files, executable cgi files and HTML files which could be downloaded without an explicit user request are removed. Data cleaning stage also involves removal of