Predicting Web Users’ Next Access Based on Log Data Rituparna SEN and Mark H. HANSEN This article considers models that describe how people browse the Web. We restrict our attention to navigation patterns within a single site, and base our study on standard Web server access logs. Given a visitor’s previous activities on the site, we propose models that predict their next page request. If the prediction is reasonably accurate, we might con- sider “prefetching” the page before the visitor requests it. A more conservative use for such predictions would be to simply update the freshness records in a proxy or network cache, eliminating unnecessary If-Modified-Since requests. Using data from the Web site for the Computing and Mathematical Sciences Research Division of Lucent Technologies (cm.bell-labs.com) we first evaluate the predictive performance of low-order Markov mod- els. We next consider mixtures of first-order Markov models, achieving a kind of clustering of Web pages in the site. This approach is shown to perform well, while significantly reduc- ing the space required to store the model. Finally, we explore a Bayesian approach using a Dirichlet prior on the collection of links available to a user at each stage in their travels through the site. We show that the posterior probabilities derived under this model are fairly close to the cross-validation estimates of the probability of success. Key Words: AUTHOR: Please give 3–5 key words that do not appear in the title. 1. INTRODUCTION The Web is a big place. The search engine www.google.com, for example, claims to have indexed more than a billion separate Web pages. Given the large number of resources on the Web, sites are constantly competing to attract and retain new visitors. To improve a user’s experience, it is now common for sites to customize content (generating pages based on a user’s previous activities) and to consider creative caching and prefetching schemes to deliver content as quickly as possible. In both cases, a model for describing how visitors navigate a Web site can be invaluable. Rituparna Sen is TKKK, Department of Statistics, University of Chicago, 5734 S. University Avenue, Chicacgo, IL 60637 (E-mail: rsen@galton.uchicago.edu). Mark H. Hansen is Member of the Technical Staff, Bell Laboratories, 600 Mountain Avenue, 2C283, Murray Hill, NJ 07974 (E-mail: cocteau@bell-labs.com). c 2003 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America Journal of Computational and Graphical Statistics, Volume 12, Number 1, Pages 1–13 DOI: 10.1198/1061860031275 1