A capture–recapture sampling standardization for improving Internet meta-search
Ioannis Anagnostopoulos ⁎
University of the Aegean, Department of Information and Communications Systems Engineering, Karlovassi - 83200, Samos, Greece
abstract article info
Article history:
Received 21 April 2007
Received in revised form 23 June 2009
Accepted 26 August 2009
Available online 3 September 2009
Keywords:
Internet meta-search
Web evolution
User browsing behavior
Capture–recapture standardization
This work describes a novel sampling standardization for improving precision of third-party results in web
meta-search. The standardization uses the capture–recapture methodology which is mainly applied for
estimating evolution rates in wildlife biological studies, as well as a mechanism that records the users'
browsing behavior during their web search sessions. The paper provides the implementation details and the
initial assessment of a third-party results ranking algorithm which employs both mechanisms. It is proved
that an important quality factor in providing relevant information is how frequent Internet search services
refresh their databases. It is also proved that when users' browsing behavior is jointly examined with the
ability of the Internet search services to offer new and fresh results, a more effective meta-search is provided.
Experimental results have shown that the precision levels of third-party results were significantly increased
in several recall levels, over a six-month period. For acquiring third-party results we used five known web
search services namely, AltaVista, Google, Lycos, MSN, and Yahoo!
© 2009 Elsevier B.V. All rights reserved.
1. Introduction
Nowadays, Internet and World Wide Web stand as the most valuable
tool in the hands of millions of users worldwide. Their use boosted the
demand for effective search among vast amount of information which is
disseminated in a chaotic way. Search services (or the so-called search
engines) are the information brokers between users and information.
Their role is to bridge the gap and bring together the information and the
users' needs. On one hand, information on the web is mainly represented
through terms and syntactic rules, while on the other hand humans have
the ability of use abstract, generic and semantic concepts to conceive,
describe and represent their information needs. As a result, even if two
users submit the same query in a web search service, their information
needs may be different [1]. Thus, the additional need to construct user
profiles for specific and personalized search seems to be a necessity [2,3].
In many methods presented in the literature, users (or evaluators) are
asked to fill forms describing their interests, or they are asked to label their
information needs among already built categories and taxonomies [4–6].
Nevertheless, the problem is that most of the users are unwilling to
provide explicit feedback on the returned results and their interest [7].
Solutions are provided by methods and techniques that automatically
build user profiles for personalized search [8–11]. In these techniques,
users interact with the web browser, while their behavioral patterns are
implicitly recorded and evaluated.
Parallel to all that, the exponential growth of web poses a serious
challenge for the Internet search services, due to the fact that their
effectiveness relies on their information coverage. However, search
services not only have to cover an increasing quantity of information,
but also to deal with evolution incidents, since new web documents and
objects are relentlessly added, old ones are moved, while others
frequently have their content changed or updated. All the results
provided in respect to a user's query are actually an “image from the
past”. This means that an important quality factor in providing relevant
information is how frequently search services refresh their databases,
since users looking for current information will find it only if the index of
the search services is up-to-date. A good question that describes the
problem is how far this “past image” is from the present.
Measuring evolution rates in the cache directories of web search
services derives its importance from several issues concerning both
search engines and their users. On the one hand, such kind of evolution
rates highlights the ability of search engines to cope with the relentless
information changes that occur on the web. Up-to-date results, new
information coverage and validity of the provided results, show how
well a search engine works. All these are expressed through freshness,
ability of covering new entries, removal of obsolete content etc. It should
also be mentioned that most of the major web search services use cache
evolution metrics in their ranking algorithms. For example, the
freshness rate is introduced as a ranking factor in the work of Acharya
et al. where several freshness factors are used as ranking features (e.g.
document inception date, content updates/changes, link-based fresh-
ness criteria, and changes in anchor texts) [12]. On the other hand, it is
important for a search engine to monitor such evolution rates, since
these rates can be seen as a factor in information quality [13]. Older
information may be helpful, but no one can argue that a user mostly
prefers a new or up-to-date content [14]. In web-based research for
Computer Standards & Interfaces 32 (2010) 61–70
⁎ Tel.: +30 22730 82237.
E-mail address: janag@aegean.gr.
0920-5489/$ – see front matter © 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.csi.2009.08.001
Contents lists available at ScienceDirect
Computer Standards & Interfaces
journal homepage: www.elsevier.com/locate/csi