Content-Based Community Formation in Hybrid Peer-to-Peer Networks Atip Asvanund Ramayya Krishnan atip@andrew.cmu.edu rk2x@andrew.cmu.edu e H. John Heinz III School of Public Policy and Management Carnegie Mellon University, Pittsburgh, PA 15146 Abstract We present a community formation method for improv- ing IR performance in a distributed digital library network. Following the digital library literature, we model digital li- braries as leaf nodes and regional directory services as ul- trapeers in the hybrid P2P architectures. We conceptu- alize a regional directory service and its local network of digital libraries as a community, and introduce a commu- nity formation method in which digital libraries prefer to connect to the best regional directory service. Conversely, regional directory services prefer to recruit the best digital libraries to its community, and to establish connections to the best adjacent regional directory services. Our method utilizes content-based techniques to identify users’ content interests and IR patterns in order to evolve the network to- wards a topology that improves IR efficiency. e method operates in a decentralized and transient setting without the assumption of a central social planner and uniquely identified content, which are typical assumptions in other work. Finally, we take precautions into examining mea- sures for reducing the overheads in applying our method and demonstrate conditions in which the benefits in apply- ing our method outweigh its costs. is work introduces an efficient solution that is readily implementable in the practical assumptions of today’s P2P networks. 1 Introduction Hybrid peer-to-peer (P2P) networking applications have gained much popularity in the recent years with the emer- gence of applications such as Gnutella 0.6 and KaZaA [7, 9]. Although they serve as an important step towards improving information retrieval (IR) performance in P2P networks, their IR method is still limited to name-based techniques and little is being done to take advantage of 1 Financial support was provided by the National Science Foundation through grant IIS-0118767. the sociological structures, such as content interests and IR patterns, of P2P network users. Our work introduces a practical protocol extension that improves IR performance by exploiting the ultrapeers and leaf nodes structure in ex- tant hybrid architectures to promote community forma- tion among users. e communities enable users whose content interests and IR patterns are mutually beneficial to collocate in the same or nearby communities. Our method utilizes content-based techniques to identify users’ con- tent interests and IR patterns, and evolve the network to- wards a topology that improves IR efficiency. e method operates in a decentralized and transient setting without the assumption of a central social planner and uniquely identified content, which are typical assumptions in other work [4]. Finally, we take precautions into examining mea- sures for reducing the overheads in applying our method and demonstrate conditions in which the benefits of our method outweigh its costs. is work introduces an effi- cient solution that is readily implementable in the practical assumptions of today’s P2P networks. In response to the rampant popularity of P2P network- ing applications, there have recently been tremendous ef- forts to improve the IR performance of P2P networks. Hy- brid architectures have been introduced to work around the inefficiencies associated with the former architectural designs, while retaining their core benefits. As in central- ized architectures such as the original Napster, peers (a.k.a. leaf nodes) establish a connection to local central servers (a.k.a. ultrapeers or super nodes) [13]. e connection between a leaf node and an ultrapeer is similar to the con- nection between a peer and a central server in centralized P2P networks. However, unlike centralized P2P networks, ultrapeers are connected to each other in a structure com- parable to decentralized networks (e.g., Gnutella 0.4) [6]. is design innovation increases the scalability of decen- tralized networks while avoiding the single point vulnera- bility of centralized networks. While hybrid architectures may provide improvements