Web Page Prediction by Clustering and Integrated Distance Measure
Poornalatha G
1
, Prakash S Raghavendra
2
Information Technology Department
National Institute of Technology Karnataka (NITK), Surathkal,
Mangalore, India
poornalathag@yahoo.com
1
, srp@nitk.ac.in
2
Abstract—The tremendous progress of the internet and the
World Wide Web in the recent era has emphasized the
requirement for reducing the latency at the client or the user
end. In general, caching and prefetching techniques are used to
reduce the delay experienced by the user while waiting to get
the web page from the remote web server. The present paper
attempts to solve the problem of predicting the next page to be
accessed by the user based on the mining of web server logs
that maintains the information of users who access the web
site. The prediction of next page to be visited by the user may
be pre fetched by the browser which in turn reduces the
latency for user. Thus analyzing user’s past behavior to predict
the future web pages to be navigated by the user is of great
importance. The proposed model yields good prediction
accuracy compared to the existing methods like Markov
model, association rule, ANN etc.
Keywords-sequence alignment; user session; clustering;
I. INTRODUCTION
Web page prediction is the problem of forecasting the
next page that might be visited by the user from the current
active page or most recently visited previous pages. Various
applications like web page recommendation, web site
restructure, web caching and pre fetching, determining most
appropriate place for advertisements, search engines etc.
would benefit from the good prediction model. Consequently
the web page prediction has gained more importance in
recent years among research community. This paper
proposes a prediction model that could be employed for
above mentioned applications in general and web page
prefetching in specific.
There are many architectures and related algorithms for
developing web page predictor. Most of the researchers
emphasize on the Markov model. Markov model is a
mathematical tool for statistical modeling. The basic concept
of the Markov model is to predict the next action, depending
on the result of previous actions. Researchers have adopted
this technique successfully in literature for training and
testing the user actions and thus predicting their behavior in
future. Deshpande et al. [1] discussed different techniques
for selecting parts of different order Markov models to get a
model with high predictive accuracy and less state
complexity. The main idea was to eliminate some of the
states of different order Markov models based on frequency,
confidence and error. They predicted last page of the test
session for evaluation purpose. Kim et al. [2] proposed a
hybrid model by using Markov model, sequential association
rule, association rule and a default model to improve the
performance specifically the recall. But it did not improve
the prediction accuracy. Khalil et al. [3] tried to improve the
web page prediction accuracy by integrating clustering,
association rules and Markov models. Dutta et al. [4]
integrated first order Markov model with web page rank
based on the link structure to get better prediction accuracy.
Awad et al. combined Markov model with artificial neural
network [5] and Support Vector Machines [6]. The final
prediction was based on the Dempster’s Rule. LRS was used
to reduce the model complexity. Frequency matrix was used
to represent first order Markov model.
J. Pitkow et al. [7] made an effort to reduce the model
size compare to Markov models. But the static LRS model
used by them may not be suitable for real time prediction
model. Jalali et al. [8] [9] proposed LCS based algorithm for
predicting user’s future requests. They used graph
partitioning algorithm for clustering and used LCS for
classification. A new session is classified into any one of the
clusters and prediction list is generated based on the
navigation patterns of corresponding cluster. Gunduz et al.
[10] proposed a prediction model by considering order
information of pages and time spent on them in a session.
User sessions are clustered by using graph partitioning
algorithm and each cluster is represented by click stream
tree. Mukhopadhyay et al. [11] proposed a clustering method
to group related web pages based on access patterns. They
used page ranking to build the prediction model in the initial
stages. However the average hit percentage was around 50%
for prediction window size of three and 35% for the
prediction window with size two. Anitha [12] clustered the
web log based on pair-wise nearest neighbor method,
determined the next page accessed by sequential pattern
mining. The sequence is not taken into consideration for
clustering and Markov model is used for sequential mining.
Lu et al. [13] generated significant usage patterns from the
abstracted web session clusters by using Needleman-Wunsch
global alignment algorithm. First order Markov model was
build for each cluster. The accuracy of lower order Markov
models are less and higher order Markov models consume
space due to many states. Therefore, to improve the
accuracy, researchers tried to integrate the Markov model
with other data mining techniques like clustering, association
rule etc.
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.231
1381
2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
978-0-7695-4799-2/12 $26.00 © 2012 IEEE
DOI 10.1109/ASONAM.2012.231
1349