Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys
∗
Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer
School of Computer and Communication Sciences
Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL)
Lausanne, Switzerland
firstname.lastname@epfl.ch
Abstract
The suitability of Peer-to-Peer (P2P) approaches for full-
text web retrieval has recently been questioned because of
the claimed unacceptable bandwidth consumption induced
by retrieval from very large document collections.
In this contribution we formalize a novel index-
ing/retrieval model that achieves high performance, cost-
efficient retrieval by indexing with highly discriminative
keys (HDKs) stored in a distributed global index main-
tained in a structured P2P network. HDKs correspond to
carefully selected terms and term sets appearing in a small
number of collection documents. We provide a theoretical
analysis of the scalability of our retrieval model and report
experimental results obtained with our HDK-based P2P re-
trieval engine. These results show that, despite increased
indexing costs, the total traffic generated with the HDK ap-
proach is significantly smaller than the one obtained with
distributed single-term indexing strategies. Furthermore,
our experiments show that the retrieval performance ob-
tained with a random set of real queries is comparable to
the one of centralized, single-term solution using the best
state-of-the-art BM25 relevance computation scheme. Fi-
nally, our scalability analysis demonstrates that the HDK
approach can scale to large networks of peers indexing
web-size document collections, thus opening the way to-
wards viable, truly-decentralized web retrieval.
1. Introduction
Contrarily to traditional information retrieval (IR) sys-
tems that build upon centralized or clustered architectures,
P2P retrieval engines theoretically offer the possibility to
cope with web-scale document collections by distributing
∗
The work presented in this paper was carried out in the framework of
the EPFL Center for Global Computing and supported by the Swiss Na-
tional Funding Agency OFES as part of the European FP 6 STREP project
ALVIS (002068)
the indexing and querying load over large networks of col-
laborating peers. However, while P2P distribution results
in smaller resource consumption at the level of individual
peers, there is an ongoing debate about the overall scala-
bility of P2P web search because of the claimed unaccept-
able bandwidth consumption induced by retrieval from very
large document collections. In [7] for example, it is shown
that a na¨ ıve use of structured or unstructured P2P networks
for web retrieval leads to practically nonviable systems, as
the traffic generated by such systems would exceed the ca-
pacity of the existing communication networks. Addition-
ally, a recent study [20] has shown that, even when carefully
optimized, P2P algorithms using traditional single-term in-
dexes in structured P2P networks do not scale to web size
document collections. Similarly, even for more sophisti-
cated schemes, such as term-to-peer indexing [5, 4] or hi-
erarchical federated architectures [2, 9], there is little evi-
dence on whether these approaches can scale to web sizes.
The design of scalable models for full-text IR over P2P
networks therefore remains an open issue. We argue that
any solution to this problem should at least verify the fol-
lowing three properties: (1) it should support unrestricted
multi-term queries; (2) it should provide retrieval perfor-
mance comparable to state-of-the-art centralized search en-
gines; and (3) it should scale to very large networks, pos-
sibly consisting of millions of peers. In addition, as the
natural P2P solution for processing document collections
that reach unmanageable sizes is to increase the number of
available peers, we focus on use case scenarios in which the
maximal number of documents each peer contributes to the
global network can be assumed constant which again makes
bandwidth consumption the major concern.
This paper formalizes our novel indexing model (origi-
nally introduced in [13]) that maintains indexing at docu-
ment granularity and is characterized by the following cen-
tral property: We carefully select the keys used for indexing
so that they consist of terms and term sets that are discrimi-
native with respect to the document collection, i.e. appear in
a limited number of documents. Such keys, which may be
1-4244-0803-2/07/$20.00 ©2007 IEEE 1096