Tadhg O’Meara and
Ahmed Patel
University College Dublin
IEEE INTERNET COMPUTING 1089-7801/01/$10.00 ©2001 IEEE http://computer.org/internet/ MARCH • APRIL 2001 27
Search Technologies
A Topic-Specific Web
Robot Model Based
on Restless Bandits
W
eb search engine design is pri-
marily concerned with two
distinct processes: ranking and
indexing.
1
Ranking returns a list of the
most relevant documents in response to a
given query. Efficient ranking requires
indexing, in which search engines con-
struct and maintain a database, or index,
of available documents.
Document acquisition can follow either
a push or pull model. In the push model,
publishers submit documents to a search
engine for indexing. In the pull model,
search engines acquire documents. Web
robots—Web crawlers or spiders—acquire
documents from Web servers by following
hyperlinks. Robots require little or no
cooperation from document publishers,
and give search engines control over what
is indexed. Today, most robots attempt to
build an index of all documents on the
Web, or of a representative sample. In the
future, however, the use of topic-specific
Web robots, which automatically build
and maintain indexes of topically related
Web pages, will increase significantly.
In this article, we outline the potential
role of topic-specific robots in distributed
search engine design, and we model the
complex problem of automatically con-
structing and maintaining topic-specific
Web indexes. Experimental results estab-
lish the viability of a topic-specific Web
robot design based on the restless bandit
model. The results indicate that our pro-
posed algorithm is a good foundation on
which to build a complete solution.
A Distributed Search
Architecture
Search engine design that can scale with
Web growth is a long-standing research
goal. Today’s predominant engines (such
as AltaVista, Fast, Google, and Inktomi)
employ a centralized search architecture.
Each provides a ranking service for all
queries in the search services market. The
ranking, indexing, and database compo-
nents of these engines can be distributed
across many computers. Efficient distrib-
ution is achieved by enabling the ranking
and indexing processes to access and con-
Constructing and maintaining topic-specific Web indexes
is modeled by a restless-bandits generalization and
resolved by a reinforcement-learning algorithm.