L EE WAVE: Level-Wise Distribution of Wavelet Coefficients for Processing kNN Queries over Distributed Streams Mi-Yen Yeh †,‡ Kun-Lung Wu ‡ Philip S. Yu § Ming-Syan Chen † † Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan ‡ IBM T.J. Watson Research Center, Hawthorne, NY § Department of Computer Science, University of Illinois at Chicago, Chicago, IL miyen@arbor.ee.ntu.edu.tw, klwu@us.ibm.com, psyu@cs.uic.edu, mschen@cc.ee.ntu.edu.tw ABSTRACT We present LEEWAVE − a bandwidth-efficient approach to search- ing range-specified k-nearest neighbors among distributed streams by LEvEl-wise distribution of WAVElet coefficients. To find the k most similar streams to a range-specified reference one, the relevant wavelet coefficients of the reference stream can be sent to the peer sites to compute the similarities. However, bandwidth can be un- necessarily wasted if the entire relevant coefficients are sent simul- taneously. Instead, we present a level-wise approach by leveraging the multi-resolution property of the wavelet coefficients. Starting from the top and moving down one level at a time, the query initia- tor sends only the single-level coefficients to a progressively shrink- ing set of candidates. However, there is one difficult challenge in LEEWAVE: how does the query initiator prune the candidates with- out knowing all the relevant coefficients? To overcome this chal- lenge, we derive and maintain a similarity range for each candidate and gradually tighten the bounds of this range as we move from one level to the next. The increasingly tightened similarity ranges en- able the query initiator to effectively prune the candidates without causing any false dismissal. Extensive experiments with real and synthetic data show that, when compared with prior approaches, LEEWAVE uses significantly less bandwidth under a wide range of conditions. 1. INTRODUCTION Processing data streams has become increasingly important as more and more emerging applications are required to handle a large amount of data in the form of rapidly arriving streams. Exam- ples include data analysis in sensor networks, program trading in financial markets, video surveillance and weather forecasting. In response, many organizations [1, 3, 5, 9, 30, 32] have started de- veloping data stream processing systems (DSPS). Finding k-nearest neighbors (kNN) is one of the most common applications in computing. Processing kNN queries has been one of the most studied problems in traditional non-streaming database research. It is also believed to be the case in data stream process- Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB ‘08, August 24-30, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 000-0-00000-000-0/00/00. P init ...... Query Processor value time S ref ...... ... user-defined query range Collected Statistics time ...... time ...... time ...... Answers P 2 P 1 P M Require Further Statistics Figure 1: System model. ing. For a kNN query, the DSPS will find the k streams that have more similar patterns than others to a given pattern contained in a reference stream. Compared to kNN query processing in tra- ditional databases, stream-based kNN query processing is much more challenging. It must handle an endlessly growing amount of data with limited resources. Nevertheless, many researchers have started working on various aspects of stream-based kNN query pro- cessing [13, 17, 19, 21]. But, these works mainly focus on the case where data streams are collected and processed at a central site. In many real-world applications, however, data streams are usu- ally collected in a decentralized manner. For example, to forecast the weather and track global climate changes, meteorologists col- lect streams of measurements, like temperatures, from observation stations located over a wide area. In surveillance, video cameras are set up in many places and continuously capture images from various angles. Finally, readings from a sensor network are col- lected in a distributed fashion. In these cases, it is inefficient to gather all of the distributed streams to a central site before doing any query processing. It is even impossible to do so when the available network bandwidth is limited. Hence, there is a need to develop a bandwidth-efficient approach to processing kNN queries among distributed streams. In this paper, we study the problem of processing distributed kNN (k-similarity) queries. The system model is shown in Fig 1, where there are M distributed sites, each monitoring one or more streams. These sites can communicate with each other via a com- munication network. Given a reference stream S ref maintained by an initiator site, Pinit , the goal is to find the k streams among all M sites with the highest similarities to S ref than other streams in 586 Permission to make digital or hard copies of portions of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyright for components of this work owned by others than VLDB Endowment must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers or to redistribute to lists requires prior specific permission and/or a fee. Request permission to republish from: Publications Dept., ACM, Inc. Fax +1 (212) 869-0481 or permissions@acm.org. PVLDB '08, August 23-28, 2008, Auckland, New Zealand Copyright 2008 VLDB Endowment, ACM 978-1-60558-305-1/08/08