Architectural Design of WebScales - A Large-Scale Metasearch Engine Weiyi Meng SUNY at Binghamton Department of Computer Science Binghamton, NY 13902 1-607-777-4311 meng@cs.binghamton.edu Clement Yu University of Illinois at Chicago Department of Computer Science Chicago, IL 60607 1-312-996-2318 yu@cs.uic.edu Zonghuan Wu, Vijay Raghavan University of Louisiana at Lafayette Center for Advan. Computer Studies Lafayette, LA 70504 1-337-482-5243 {zwu,raghavan}@cacs.louisiana.edu ABSTRACT It is estimated that there are hundreds of thousands of information sources on the Web, including both the Surface Web and the Deep Web. Most of these sources have their own search capabilities. In order to alleviate ordinary users from the formidable task of identifying useful sources and search them individually, it is important to provide a unified access to these sources. Metasearch engine is such a system that can provide unified access to multiple existing search systems. In this paper, we provide an architectural design of a large-scale metasearch system, WebScales. We also discuss what we have already done in developing an operational system based on the proposed architecture. Categories and Subject Descriptors C.2.4 [Computer-Communication Networks]: Distributed Systems - distributed databases. H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - search process; selection process. H.3.4 [Information Storage and Retrieval]: Systems and Software - information networks. General Terms Algorithms, Design Keywords Metasearch, search engine, distributed digital library, WebScales 1. INTRODUCTION The World Wide Web has been considered as the largest digital library in recent years. People all over the world use the Web to find all sorts of information, from medical information to food recipes. The Web is inherently distributed and the data on the Web is stored in numerous sources. Often, these sources have their search capabilities. For example, many organizations, companies and universities have their own Web sites and most of them have their own search engines. The information a particular user needs is frequently stored in multiple sources. For example, publications on a subject may be available in the sources of many publishers, universities and research labs. It is difficult for an ordinary user to identify all the sources him/herself and search each source separately to find all information he/she needs. One way to address this problem is to build a metasearch engine on top of multiple sources in advance. A metasearch engine is a system that supports unified access to multiple existing search engines. Upon receiving a query, it invokes some of the underlying search engines to retrieve suitable documents. It then re-ranks the documents from the different sources and presents the retrieved documents to the user. In comparison to an underlying search engine, the metasearch engine has higher coverage, since its document database (i.e., the set of documents searchable by the search engine) is the union of the document databases of all the individual search engines. The advantages of metasearch engines have been realized by many researchers and practitioners. As a result, many metasearch engines have been created (see, e.g., SavvySearch (www.search.com ), ProFusion (www.profusion.com ), Dogpile (www.dogpile.com )) and a large amount of research work has been published in the literature (see Section 3). In this paper, we describe the architectural design of WebScales, a large-scale metasearch engine under development. The main difference between WebScales and other metasearch engines is that the former aims to connect to all useful search engines. It is estimated the number of search engines on the Web is in the hundreds of thousands [1]. Therefore, WebScales must emphasize scalability and automation. Existing metasearch engines, in contrast, are small in scale. The current largest metasearch engine, for example, ProFusion, connects to about 1000 search engines. The rest of the paper is organized as follows. In Section 2, we present the architectural design of WebScales. Different components and their relationships will be discussed. In Section 3, related work with respect to different components will be compared. In Section 4, we present our design and implementation for several components of WebScales. Section 5 concludes the paper.