SharpSpider: Spidering the Web through Web Services Ken Moody and Marco Palomino Computer Laboratory University of Cambridge Cambridge, CB3 0FD {Ken.Moody, Marco.Palomino}@cl.cam.ac.uk Abstract Web search engines have become an indispensable utility for Internet users. In the near future, however, Web search engines will not only be expected to provide quality search results, but also to enable applications to search and exploit their index repositories directly. We present here SharpSpi- der, a distributed, C# spider designed to address the is- sues of scalability, decentralisation and continuity of a Web crawl. Fundamental to the design of SharpSpider is the pub- lication of an API for use by other services on the network. Such an API grants access to a constantly refreshed index built after successive crawls of the Web. 1. Introduction Many applications have lately emerged with an intrin- sic need of searching for data on the Web. Search via non- HTML interfaces, automated market research and auto- mated comparison shopping are just a few examples. Al- though these applications cannot be seen as an extension of a search engine, they all call for systems to seek, download and index Web pages on a massive scale. In the near future, Web search engines will not only be expected to provide quality search results, but also to en- able remote access to their repositories. Due to their open standards, platform independence and focus on collabora- tion, Web services [5] seem to be an ideal environment for the integration between Web search engines and applica- tions whose main input is data collected from the Web. Here, we report on SharpSpider, a distributed, C# spi- der that has been designed to address the issues of scala- bility, decentralisation and continuity of a Web crawl. Fun- damental to the design of SharpSpider is the deﬁnition and publication of an API for use by other services on the net- work. It is through this API that other parties interested in our software can connect remotely and issue queries to our constantly refreshed lexicons and indices. 2. Related Work A spider is a program that automatically downloads Web pages, parses them to collect information and uses that in- formation to download more pages. Most of the recently de- veloped spiders consist of cooperating processes that down- load Web pages, extract their links and sometimes send those links to other peer processes responsible for them [7, 8, 9, 11, 13, 15]. SharpSpider also comprises various communicating instances spread across different comput- ers, and we coordinate their execution dynamically. In sec- tion 4, we elaborate on our distributed architecture. At present, Google [2] seems to be the only search en- gine researching the possibility of providing access to the indices created by a spider. In April 2002, Google released a beta version of a Web API Service [3] intended to al- low developers to build applications on top of the Google search engine. Although this represents a major innovation, Google was not originally designed for this service. Its opti- mised infrastructure is tuned for end users, but imposes lim- itations on automated access from other services. 3. SharpSpider Features All the data structures included in SharpSpider are de- signed to keep operational costs as low as possible. For in- stance, to maintain a list of the URLs already downloaded, we have implemented our own data store, based on the ex- tendible hashing algorithm [14], as an alternative to a rela- tional database server such as SQL Server or Oracle. This design choice gives us the opportunity of customising the algorithms for our speciﬁc needs. SharpSpider maintains a list of the URLs waiting to be downloaded. For this purpose, we have implemented an ar- ray of priority queues, where each entry in the array con- tains a queue corresponding to a speciﬁc host. Each queue includes all the URLs of its associated host found during a crawl. If we are executing more than one instance of Sharp- Spider in different locations, then any URL that belongs to a Proceedings of the First Latin American Web Congress (LA-WEB 2003) 0-7695-2058-8/03 $17.00 © 2003 IEEE