Building a peer-to-peer full-text Web search engine with highly discriminative keys Karl Aberer, Fabius Klemm, Toan Luu, Ivana Podnar, Martin Rajman School of Computer and Communication Sciences Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL) Lausanne, Switzerland Abstract Web search engines designed on top of peer-to-peer (P2P) overlay networks show promise to en- able attractive search scenarios operating at a large scale. However the design of eﬀective indexing techniques for extremely large document collections still raises a number of open technical challenges. Resource sharing, self-organization, and low maintenance costs are favorable properties of P2P over- lays in the perspective of large-scale search, but we also face new problems due to potentially huge bandwidth consumption during both indexing and querying, as well as the unavailability of global document collection statistics. Since a straightforward application of P2P solutions for Web search generates unscalable indexing and search traﬃc, we propose a novel indexing technique which main- tains a global key index in structured P2P overlays. Keys are highly-discriminative terms and term sets that appear in a restricted number of collection documents, thus limiting the size of the global index, while ensuring scalable search cost. Our experimental results show reasonable indexing costs while the retrieval quality is comparable to standard centralized solutions with TF-IDF ranking. Our indexing scheme represents a contribution toward realistic P2P Web search engines that opens the opportunity to virtually unlimited resources, well beyond the capacity of today’s best centralized Web search engines. Keywords: peer-to-peer information systems, information retrieval, distributed indexing 1 Introduction Web search over P2P overlay networks has recently become an intensive ﬁeld of study as this approach bears the potential to become an attractive alternative to current Web search engines, both for technical and economic reasons. Contemporary Web search engines based on large computer clusters are expected to reach soon scalability limits. Recently it has been argued that the required centralized coordination service for handling incoming queries, even if replicated, is a major system bottleneck [1]. On the other hand, P2P overlay networks have no central coordination service, and, as such, are promising candidates for next-generation search engines. Moreover, P2P Web search is appealing from an economic perspec- tive [2]. It allows higher diversity in contents and search methods, and supports community-oriented publishing and search of Web content. P2P networks require minimal infrastructure and maintenance, and potentially provide unlimited resources provided that enough peers join the network. However, there is an ongoing debate on the feasibility of P2P Web search for scalability reasons. In [3] it is shown that the na¨ ıve use of unstructured or structured overlay networks is practically infeasible 1