Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana Podnar, Martin Rajman, Toan Luu, Fabius Klemm, Karl Aberer School of Computer and Communication Sciences Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL) Lausanne, Switzerland firstname.lastname@epfl.ch Abstract The suitability of Peer-to-Peer (P2P) approaches for full- text web retrieval has recently been questioned because of the claimed unacceptable bandwidth consumption induced by retrieval from very large document collections. In this contribution we formalize a novel index- ing/retrieval model that achieves high performance, cost- efficient retrieval by indexing with highly discriminative keys (HDKs) stored in a distributed global index main- tained in a structured P2P network. HDKs correspond to carefully selected terms and term sets appearing in a small number of collection documents. We provide a theoretical analysis of the scalability of our retrieval model and report experimental results obtained with our HDK-based P2P re- trieval engine. These results show that, despite increased indexing costs, the total traffic generated with the HDK ap- proach is significantly smaller than the one obtained with distributed single-term indexing strategies. Furthermore, our experiments show that the retrieval performance ob- tained with a random set of real queries is comparable to the one of centralized, single-term solution using the best state-of-the-art BM25 relevance computation scheme. Fi- nally, our scalability analysis demonstrates that the HDK approach can scale to large networks of peers indexing web-size document collections, thus opening the way to- wards viable, truly-decentralized web retrieval. 1. Introduction Contrarily to traditional information retrieval (IR) sys- tems that build upon centralized or clustered architectures, P2P retrieval engines theoretically offer the possibility to cope with web-scale document collections by distributing The work presented in this paper was carried out in the framework of the EPFL Center for Global Computing and supported by the Swiss Na- tional Funding Agency OFES as part of the European FP 6 STREP project ALVIS (002068) the indexing and querying load over large networks of col- laborating peers. However, while P2P distribution results in smaller resource consumption at the level of individual peers, there is an ongoing debate about the overall scala- bility of P2P web search because of the claimed unaccept- able bandwidth consumption induced by retrieval from very large document collections. In [7] for example, it is shown that a na¨ ıve use of structured or unstructured P2P networks for web retrieval leads to practically nonviable systems, as the traffic generated by such systems would exceed the ca- pacity of the existing communication networks. Addition- ally, a recent study [20] has shown that, even when carefully optimized, P2P algorithms using traditional single-term in- dexes in structured P2P networks do not scale to web size document collections. Similarly, even for more sophisti- cated schemes, such as term-to-peer indexing [5, 4] or hi- erarchical federated architectures [2, 9], there is little evi- dence on whether these approaches can scale to web sizes. The design of scalable models for full-text IR over P2P networks therefore remains an open issue. We argue that any solution to this problem should at least verify the fol- lowing three properties: (1) it should support unrestricted multi-term queries; (2) it should provide retrieval perfor- mance comparable to state-of-the-art centralized search en- gines; and (3) it should scale to very large networks, pos- sibly consisting of millions of peers. In addition, as the natural P2P solution for processing document collections that reach unmanageable sizes is to increase the number of available peers, we focus on use case scenarios in which the maximal number of documents each peer contributes to the global network can be assumed constant which again makes bandwidth consumption the major concern. This paper formalizes our novel indexing model (origi- nally introduced in [13]) that maintains indexing at docu- ment granularity and is characterized by the following cen- tral property: We carefully select the keys used for indexing so that they consist of terms and term sets that are discrimi- native with respect to the document collection, i.e. appear in a limited number of documents. Such keys, which may be 1-4244-0803-2/07/$20.00 ©2007 IEEE 1096