Efficient Data Dissemination in Overlays * Dung Vu, Thomas Repantis and Vana Kalogeraki Department of Computer Science and Engineering University of California, Riverside, Riverside, CA 92521 Email: {dungv, trep, vana}@cs.ucr.edu Abstract In this paper we propose adaptive data dissemination al- gorithms for intelligently routing search queries in a peer- to-peer network. In our mechanism nodes build content syn- opses of their data and adaptively disseminate them to the most appropriate nodes. Based on the content synopses, a routing mechanism is being built to forward the queries to those nodes that have a high probability of providing the de- sired results. Our simulation results show that our approach is highly scalable and significantly improves resources us- age by saving both bandwidth and processing power. 1. Introduction The explosive growth of rich-media online content, such as audio, video, news articles, images, and documents has created new challenges for real-time collaboration among multiple users in large-scale distributed environments. This, coupled with advances in the networking, processing and storage capabilities of personal computers has signaled the emergence of peer-to-peer (P2P) systems as a platform for providing and receiving data and services. In a peer-to-peer system, peers form an overlay over the physical network and employ their own location and routing mechanisms and maintain soft state information about other peers. The peers can be geographically distributed, heterogeneous in their resource capabilities, and dynamic in their participation in the system. Peer-to-peer systems have been used with great success for storing and sharing data [8] as well as for per- forming distributed computations. Some of their attractive features include cost effectiveness (by aggregating existing resources), increased autonomy (by self-organizing), im- proved scalability (due to the absence of a central coordi- nator’s bottleneck), and reliability (due to lack of a single point of failure). Several research efforts on Distributed Hash Tables [19, * This research was supported by NSF Award 0627191. 21] have focused on imposing a structure in the peer-to- peer overlay by employing different algorithms to assign object keys to nodes to guarantee key retrieval in logarith- mic time. Even though structured overlays achieve object retrieval in bounded time, they have been inherently limited in other ways [3]: They do not support complex keyword- based queries without constraints on data placement, they do not take peer heterogeneity into account, and do not han- dle robustly network dynamics, like massive peer arrivals, departures, or failures. Several efforts have been made to address all of the above issues [3]. In this work, we focus on unstructured overlay networks, in which objects can be located at random nodes, and nodes are able to join the system at random times and leave it with- out a priori notification. Our motivation stems from the fact that unstructured overlays have been deployed and are being used by millions of Internet users. However, in an unstruc- tured topology several design issues arise, one of the most challenging ones being the efficient search and retrieval of data or services 1 . The major issue is that no central man- ager can have an accurate global view of the system’s con- tents. The problem is complicated further by the fact that the environment is dynamic and heterogeneous. Peers join, leave, and fail without a priori notification and have very different and restricted processor, storage and communica- tion capabilities. Finally, in a large-scale peer-to-peer net- work, the amount of traffic generated by queries can be overwhelming. Traditionally, search in unstructured peer-to-peer net- works has been performed based on keyword queries, by flooding the network with messages and propagating the search query hop-by-hop until the desired answer is found. The problem with this approach is that it fails to take into account the probability of a node to be able to provide the requested object. Hence, the search messages travel a large number of hops, wasting processing power of many nodes, and producing large amounts of network traffic, while the answer to the query is delayed. Building upon this breadth-first search protocol, sev- 1 We will be using the term “object” to refer to both data and services.