The Importance of Prior Probabilities for Entry Page Search Wessel Kraaij TNO TPD PO BOX 155 2600 AD Delft The Netherlands kraaijw@acm.org Thijs Westerveld University of Twente PO BOX 217 7500 AE Enschede The Netherlands westerve@cs.utwente.nl Djoerd Hiemstra University of Twente PO BOX 217 7500 AE Enschede The Netherlands hiemstra@cs.utwente.nl ABSTRACT An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Experimentation Keywords Entry Page Search, Prior Probabilities, Links, URLs, Language Mod- els, Parameter Estimation 1. INTRODUCTION Entry page searching is different from general information search- ing, not only because entry pages differ from other web documents, but also because the goals of the tasks are different. In a general, Ad Hoc search task as defined for TREC [35, 36, 37], the goal is to find as many relevant documents as possible. The entry page (EP) task is concerned with finding the central page of an organisation, which functions as a portal for the information 1 . Since EP search has the goal to retrieve just one document, an IR system should probably be more optimised for high precision than for high recall. Search engine users typically prefer to find an EP in the first screen 1 Entry pages for individual persons are usually referred to as home- pages. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’02, August 11-15, 2002, Tampere, Finland. Copyright 2002 ACM 1-58113-561-0/02/0008 ...$5.00. of results. Since queries are usually very short, finding an EP with a high initial precision is quite difficult. This paper explores ways to enhance IR systems designed for the Ad Hoc task by taking ad- vantage of document features, which are usually ignored for the Ad Hoc task, since they were not effective. Our experiments for the EP task at TREC-2001 have shown that link structure, URLs and anchor texts are useful sources of information for locating entry pages[38]. Experiments of other groups confirmed the effectiveness of these features [6, 14, 7, 27, 32, 39]. In this paper we show how knowl- edge about the relationship between the non-content features of a webpage an its likelihood of being an EP can easily be incorporated in a retrieval model based on statistical language models. In Section 2 we discuss related work on tuning retrieval models to a specific task and other work on the use of information sources other than the document content. Section 3 discusses the basic language mod- elling approach and how priors can be used in this model. In Sec- tion 4 we discuss different sources of prior knowledge that can be used in an EP searching task. We describe our evaluation method- ology in Section 5 and discuss our experiments and present results in Section 6. We conclude with a discussion of these results and a summary of our main conclusions. 2. CONTEXT AND RELATED WORK In this section we will give argue that it is common practice to tailor IR methods to a specific search task, or even to a certain test collection, to optimize performance. It is important though, to use principled methods instead of ad hoc solutions. 2.1 Tuning IR systems The use of proper term statistics is of the utmost importance for the quality of retrieval results on the Ad Hoc search task. All main IR models based on relevance ranking exploit these statistics and are based on at least two ingredients: the frequency of a term in a document and the distribution of a term in the document collec- tion. IR models typically have evolved in close relationship with the availability of test collections. The older test collections were based on abstracts, so it was safe to assume that documents had about the same length. Since test collections are based on full text, this assumption does not hold anymore, and IR models have been refined to include a component which models the document length influence. Early attempts to combine the three main ingredients of an IR system were rather ad hoc and were not based on formal mod- els. For example, Salton and Buckley [30] evaluated many combi- nations of term frequency statistics, document frequency statistics and document length normalisation without an explicit model for the relationship between these factors. The combine and test strat- egy was pursued further by Zobel and Moffat [40]. They tested 720 different term weighting strategies based on different similarity