Extending Peer-to-Peer Networks for Approximate Search Alain Mowat and Roman Schmidt Ecole Polytechnique F´ ed´ erale de Lausanne (EPFL) 1015 Lausanne, Switzerland Michael Schumacher University of Applied Sciences Western Switzerland 3960 Sierre, Switzerland Ion Constantinescu Digital Optim USA ABSTRACT This paper proposes a way to enable approximate queries in a peer-to-peer network by using a special encoding function and error correcting codes. The encoding function maintains neighborhood relationships so that two similar inputs will result in two similar outputs. The error correcting code is then used to group the similar encoded values around special codewords. In this manner, similar content is located as close as possible in the network. The algorithm is tested in a simulated environment on a HyperCube network overlay. 1. INTRODUCTION When searching for information on internet, it often hap- pens that the searcher does not know the correct spelling of an authors name or simply mistypes a word in the search query. In most P2P systems, this will often lead to erro- neous or no results being returned by the system. Approxi- mate queries extend the notion of normal queries by allowing entries that only partially match the initial query to be re- trieved. We call the c-neighborhood of u all possible items in Σ n that differ from u by at most c bits. The idea is that when searching for u in the network, we extend the search to all of its c-neighborhood. Peer-to-peer (P2P) networks can be classified into three categories depending on how they index and search items in their network. Systems with local or central indices can implement approximate search easily. However, their search function is not as efficient as in distributed hash table (DHT) where. In DHTs, each peer is responsible for indexing a certain range of files. All files and peers are attributed an identifier that represents it in the network. The peers then index the files that have an ID closely related to their own. The problem arises when giving the ID to the different files. Indeed, in most networks, a hashing mechanism is used. The problem is that this destroys any information about the files and thus approximate search is made impossible. In this paper we propose a new hashing (or encoding) function that preserves locality while trying to maintain an Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08 March 16-20, 2008, Fortaleza, Cear´ a, Brazil Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00. even distribution of hash values 1 . The particularity of our hashing function is that it will not map a query to one single bucket, instead it will return a range of buckets that contain the query and its c-neighborhood. Thus the hashing function will be: hAQ n →{B1, ..., Bp... ×{B1, ..., Bp}. In addition to this encoding function, the perfect Golay error correcting code is used to group similar content and queries in the network, thus allowing a certain degree of approxima- tion in the search algorithm. 2. ERROR CORRECTING CODE The error correcting code (ECC) used for our approach is the Golay code. It is a (23, 2 12 , 7) code, meaning that there are 4096 codewords and the length of a received vector is 23 bits. A received vector (RV) is a vector or string of bits that is received by the destination during a transmission on a noisy channel. The Golay code is additionally a perfect code, in the sense that each RV is mapped to one and only one codeword. In this way, the code can correct up to 3 er- roneous bits. We can imagine this as each codeword being a point in a hyperspace surrounded by a ball of radius 3. The set of all codewords are disjoint and cover the whole hyper- space. A received vector can thus be positioned anywhere and will always be contained in one of the balls (Fig. 2). In this paper, the ECC is used after the encoding of the query. The encoding creates a 23 bit vector and the Golay code then computes the corresponding codeword. As we will see, the encoding function is neighborhood-sensitive. Simi- lar queries are mapped to similar received vectors. This is when the ECC comes into play. As long as two RVs differ by at most 3 bits, the Golay code will map them to the same codeword. This way they will be indexed at the same location in the network (see Fig. 1). Figure 1: Sequence of modules transforming two similar queries into a single codeword Alone, this system is not powerful enough to correct queries that contain many errors. Only 3 bits can be corrected, so 1 The work presented in this paper was partly supported by the Swiss National Research Foundation grant Nr. PI0I2–115015 / 1 and by the Swiss National Funding Agency OFES as part of the European project NEPOMUK No FP6-027705