Load Balancing Distributed Inverted Files Mauricio Marin Carlos Gomez Yahoo! Research Santiago, Chile mmarin@yahoo-inc.com cgomez@dcc.uchile.cl ABSTRACT This paper present a comparison of scheduling algorithms applied to the context of load balancing the query traffic on distributed inverted files. We implemented a number of algorithms taken from the literature. We propose a novel method to formulate the cost of query processing so that these algorithms can be used to schedule queries onto pro- cessors. We avoid measuring load balance at the search en- gine side because this can lead to imprecise evaluation. Our method is based on the simulation of a bulk-synchronous parallel computer at the broker machine side. This simu- lation determines an optimal way of processing the queries and provides a stable baseline upon which both the broker and search engine can tune their operation in accordance with the observed query traffic. We conclude that the sim- plest load balancing heuristics are good enough to achieve efficient performance. Our method can be used in practice by broker machines to schedule queries efficiently onto the cluster processors of search engines. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process General Terms Algorithms, Performance Keywords Inverted Files, Parallel and Distributed Computing 1. INTRODUCTION Cluster based search engines use distributed inverted files [10] for dealing efficiently with high traffic of user queries. An inverted file is composed of a vocabulary table and a set of posting lists. The vocabulary table contains the set of relevant terms found in the text collection. Each of these Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WIDM’07, November 9, 2007, Lisboa, Portugal. Copyright 2007 ACM 978-1-59593-829-9/07/0011 ...$5.00. terms is associated with a posting list which contains the document identifiers where the term appears in the collec- tion along with additional data used for ranking purposes. To solve a query, it is necessary to get the set of documents associated with the query terms and then perform a ranking of these documents in order to select the top K documents as the query answer. The approach used by well-known Web search engines to the parallelization of inverted files is pragmatic, namely they use the document partitioned approach. Documents are evenly distributed on P processors and an independent inverted file is constructed for each of the P sets of doc- uments. The disadvantage is that each user query has to be sent to the P processors and it can present imbalance at posting lists level (this increases disk access and interproces- sor communication costs). The advantage is that document partitioned indexes are easy to maintain since insertion of new documents can be done locally and this locality is ex- tremely convenient for the posting list intersection opera- tions required to solve the queries (they come for free in terms of communication costs). Intersection of posting lists is necessary to determine the set of documents that contain all of the terms present in a given user query. Another competing approach is the term partitioned index in which a single inverted file is constructed from the whole text collection to then distribute evenly the terms with their respective posting lists onto the processors. However, the term partitioned inverted file destroys the possibility of com- puting intersections for free in terms of communication cost and thereby one is compelled to use strategies such as smart distribution of terms onto processors to increase locality for most frequent terms (which can be detrimental for overall load balance) and caching. However, it is not necessary to broadcast queries to all processors (which reduces commu- nication costs) and latency disk costs are smaller as they are paid once per posting list retrieval per query, and it is well-known that in current cluster technology it is faster to transfer blocks of bytes through the interprocessors network than from Ram to Disk. Nevertheless, the load balance is sensitive to queries referring to particular terms with high frequency, making it necessary to use posting lists caching strategies to overcome imbalance in disk accesses. Both strategies are efficient depending on the method used to perform the final ranking of documents. In particular, the term partitioned index is better suited for methods that do not require performing posting list intersections. We have observed that the balance of disk accesses, that is posting list fetching, and document ranking are the most relevant