A capture–recapture sampling standardization for improving Internet meta-search Ioannis Anagnostopoulos ⁎ University of the Aegean, Department of Information and Communications Systems Engineering, Karlovassi - 83200, Samos, Greece abstract article info Article history: Received 21 April 2007 Received in revised form 23 June 2009 Accepted 26 August 2009 Available online 3 September 2009 Keywords: Internet meta-search Web evolution User browsing behavior Capture–recapture standardization This work describes a novel sampling standardization for improving precision of third-party results in web meta-search. The standardization uses the capture–recapture methodology which is mainly applied for estimating evolution rates in wildlife biological studies, as well as a mechanism that records the users' browsing behavior during their web search sessions. The paper provides the implementation details and the initial assessment of a third-party results ranking algorithm which employs both mechanisms. It is proved that an important quality factor in providing relevant information is how frequent Internet search services refresh their databases. It is also proved that when users' browsing behavior is jointly examined with the ability of the Internet search services to offer new and fresh results, a more effective meta-search is provided. Experimental results have shown that the precision levels of third-party results were signiﬁcantly increased in several recall levels, over a six-month period. For acquiring third-party results we used ﬁve known web search services namely, AltaVista, Google, Lycos, MSN, and Yahoo! © 2009 Elsevier B.V. All rights reserved. 1. Introduction Nowadays, Internet and World Wide Web stand as the most valuable tool in the hands of millions of users worldwide. Their use boosted the demand for effective search among vast amount of information which is disseminated in a chaotic way. Search services (or the so-called search engines) are the information brokers between users and information. Their role is to bridge the gap and bring together the information and the users' needs. On one hand, information on the web is mainly represented through terms and syntactic rules, while on the other hand humans have the ability of use abstract, generic and semantic concepts to conceive, describe and represent their information needs. As a result, even if two users submit the same query in a web search service, their information needs may be different [1]. Thus, the additional need to construct user proﬁles for speciﬁc and personalized search seems to be a necessity [2,3]. In many methods presented in the literature, users (or evaluators) are asked to ﬁll forms describing their interests, or they are asked to label their information needs among already built categories and taxonomies [4–6]. Nevertheless, the problem is that most of the users are unwilling to provide explicit feedback on the returned results and their interest [7]. Solutions are provided by methods and techniques that automatically build user proﬁles for personalized search [8–11]. In these techniques, users interact with the web browser, while their behavioral patterns are implicitly recorded and evaluated. Parallel to all that, the exponential growth of web poses a serious challenge for the Internet search services, due to the fact that their effectiveness relies on their information coverage. However, search services not only have to cover an increasing quantity of information, but also to deal with evolution incidents, since new web documents and objects are relentlessly added, old ones are moved, while others frequently have their content changed or updated. All the results provided in respect to a user's query are actually an “image from the past”. This means that an important quality factor in providing relevant information is how frequently search services refresh their databases, since users looking for current information will ﬁnd it only if the index of the search services is up-to-date. A good question that describes the problem is how far this “past image” is from the present. Measuring evolution rates in the cache directories of web search services derives its importance from several issues concerning both search engines and their users. On the one hand, such kind of evolution rates highlights the ability of search engines to cope with the relentless information changes that occur on the web. Up-to-date results, new information coverage and validity of the provided results, show how well a search engine works. All these are expressed through freshness, ability of covering new entries, removal of obsolete content etc. It should also be mentioned that most of the major web search services use cache evolution metrics in their ranking algorithms. For example, the freshness rate is introduced as a ranking factor in the work of Acharya et al. where several freshness factors are used as ranking features (e.g. document inception date, content updates/changes, link-based fresh- ness criteria, and changes in anchor texts) [12]. On the other hand, it is important for a search engine to monitor such evolution rates, since these rates can be seen as a factor in information quality [13]. Older information may be helpful, but no one can argue that a user mostly prefers a new or up-to-date content [14]. In web-based research for Computer Standards & Interfaces 32 (2010) 61–70 ⁎ Tel.: +30 22730 82237. E-mail address: janag@aegean.gr. 0920-5489/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.csi.2009.08.001 Contents lists available at ScienceDirect Computer Standards & Interfaces journal homepage: www.elsevier.com/locate/csi