Data-Parallel Web Crawling Models ⋆ Berkant Barla Cambazoglu, Ata Turk, and Cevdet Aykanat Department of Computer Engineering, Bilkent University 06800, Ankara, Turkey {berkant,atat,aykanat}@cs.bilkent.edu.tr Abstract. The need to quickly locate, gather, and store the vast amount of material in the Web necessitates parallel computing. In this paper, we propose two models, based on multi-constraint graph-partitioning, for efficient data-parallel Web crawling. The models aim to balance the amount of data downloaded and stored by each processor as well as balancing the number of page requests made by the processors. The models also minimize the total volume of communication during the link exchange between the processors. To evaluate the performance of the models, experimental results are presented on a sample Web repository containing around 915,000 pages. 1 Introduction During the last decade, an exponential increase has been observed in the amount of the textual material in the Web. Locating, fetching, and caching this con- stantly evolving content, in general, is known as the crawling problem. Currently, crawling the whole Web by means of sequential computing systems is infeasible due to the need for vast amounts of storage and high download rates. Further- more, the recent trend in construction of cost-effective PC clusters makes the Web crawling problem an appropriate target for parallel computing. In Web crawling, starting from some seed pages, new pages are located using the hyperlinks within the already discovered pages. In parallel crawling, each processor is responsible from downloading a subset of the pages. The processors can be coordinated in three different ways: independent, master-slave, and data- parallel. In the first approach, each processor independently traverses a portion of the Web and downloads a set of pages pointed by the links it discovered. Since some pages are fetched multiple times, in this approach, there is an overlap problem, and hence, both storage space and network bandwidth are wasted. In the second approach, each processor sends its links, extracted from the pages it downloaded, to a central coordinator. This coordinator, then assigns the collected URLs to the crawling processors. The weakness of this approach is that the coordinating processor becomes a bottleneck. Our focus, in this work, is on the third approach. In this approach, pages are partitioned among the processors such that each processor is responsible from ⋆ This work is partially supported by The Scientific and Technical Research Council of Turkey (T ¨ UB ˙ ITAK) under project EEEAG-103E028. C. Aykanat et al. (Eds.): ISCIS 2004, LNCS 3280, pp. 801–809, 2004. c Springer-Verlag Berlin Heidelberg 2004