Web Page Prediction by Clustering and Integrated Distance Measure Poornalatha G 1 , Prakash S Raghavendra 2 Information Technology Department National Institute of Technology Karnataka (NITK), Surathkal, Mangalore, India poornalathag@yahoo.com 1 , srp@nitk.ac.in 2 Abstract—The tremendous progress of the internet and the World Wide Web in the recent era has emphasized the requirement for reducing the latency at the client or the user end. In general, caching and prefetching techniques are used to reduce the delay experienced by the user while waiting to get the web page from the remote web server. The present paper attempts to solve the problem of predicting the next page to be accessed by the user based on the mining of web server logs that maintains the information of users who access the web site. The prediction of next page to be visited by the user may be pre fetched by the browser which in turn reduces the latency for user. Thus analyzing user’s past behavior to predict the future web pages to be navigated by the user is of great importance. The proposed model yields good prediction accuracy compared to the existing methods like Markov model, association rule, ANN etc. Keywords-sequence alignment; user session; clustering; I. INTRODUCTION Web page prediction is the problem of forecasting the next page that might be visited by the user from the current active page or most recently visited previous pages. Various applications like web page recommendation, web site restructure, web caching and pre fetching, determining most appropriate place for advertisements, search engines etc. would benefit from the good prediction model. Consequently the web page prediction has gained more importance in recent years among research community. This paper proposes a prediction model that could be employed for above mentioned applications in general and web page prefetching in specific. There are many architectures and related algorithms for developing web page predictor. Most of the researchers emphasize on the Markov model. Markov model is a mathematical tool for statistical modeling. The basic concept of the Markov model is to predict the next action, depending on the result of previous actions. Researchers have adopted this technique successfully in literature for training and testing the user actions and thus predicting their behavior in future. Deshpande et al. [1] discussed different techniques for selecting parts of different order Markov models to get a model with high predictive accuracy and less state complexity. The main idea was to eliminate some of the states of different order Markov models based on frequency, confidence and error. They predicted last page of the test session for evaluation purpose. Kim et al. [2] proposed a hybrid model by using Markov model, sequential association rule, association rule and a default model to improve the performance specifically the recall. But it did not improve the prediction accuracy. Khalil et al. [3] tried to improve the web page prediction accuracy by integrating clustering, association rules and Markov models. Dutta et al. [4] integrated first order Markov model with web page rank based on the link structure to get better prediction accuracy. Awad et al. combined Markov model with artificial neural network [5] and Support Vector Machines [6]. The final prediction was based on the Dempster’s Rule. LRS was used to reduce the model complexity. Frequency matrix was used to represent first order Markov model. J. Pitkow et al. [7] made an effort to reduce the model size compare to Markov models. But the static LRS model used by them may not be suitable for real time prediction model. Jalali et al. [8] [9] proposed LCS based algorithm for predicting user’s future requests. They used graph partitioning algorithm for clustering and used LCS for classification. A new session is classified into any one of the clusters and prediction list is generated based on the navigation patterns of corresponding cluster. Gunduz et al. [10] proposed a prediction model by considering order information of pages and time spent on them in a session. User sessions are clustered by using graph partitioning algorithm and each cluster is represented by click stream tree. Mukhopadhyay et al. [11] proposed a clustering method to group related web pages based on access patterns. They used page ranking to build the prediction model in the initial stages. However the average hit percentage was around 50% for prediction window size of three and 35% for the prediction window with size two. Anitha [12] clustered the web log based on pair-wise nearest neighbor method, determined the next page accessed by sequential pattern mining. The sequence is not taken into consideration for clustering and Markov model is used for sequential mining. Lu et al. [13] generated significant usage patterns from the abstracted web session clusters by using Needleman-Wunsch global alignment algorithm. First order Markov model was build for each cluster. The accuracy of lower order Markov models are less and higher order Markov models consume space due to many states. Therefore, to improve the accuracy, researchers tried to integrate the Markov model with other data mining techniques like clustering, association rule etc. 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.231 1381 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 978-0-7695-4799-2/12 $26.00 © 2012 IEEE DOI 10.1109/ASONAM.2012.231 1349