Exploiting Spatial Correlation Towards an Energy Efficient Clustered AGgregation Technique (CAG) SunHee Yoon and Cyrus Shahabi Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA {sunheeyo, shahabi}@usc.edu Abstract— In Wireless Sensor Networks (WSN), monitoring applications use in-network aggregation to minimize energy overhead by reducing the number of transmissions between the nodes. We note that nearby sensor nodes monitoring an environmental feature (e.g., temperature or brightness) typically register similar values. In this paper, we propose Clustered AGgregation (CAG), which is a mechanism that reduces the number of transmissions and provides approximate results to aggregate queries by utilizing the spatial correlation of sensor data. The result is guaranteed to be within a user-provided error-tolerance threshold. While a query is disseminated to the network, CAG forms clusters of nodes sensing similar values. Subsequently, only one value per cluster is transmitted up the aggregation tree. We use mathematical models and simulations with synthetic and empirical data to evaluate the efficiency- correctness tradeoff of CAG. Our simulation shows that with highly correlated sensor reading and 10% error threshold, CAG can save the communication overhead by as much as 70.9% over TAG while incurring a modest 1.7% error in result. I. I NTRODUCTION In WSN, in-network query processing is a common way to minimize communication by increasing path sharing as in Directed Diffusion [6], TinyDB [12], and Cougar [20]. TinyDB, the landmark in-network query processing system for WSN, has a fixed set of query operators supported by a query processor. Alternately, directed Diffusion allows users to define their own in-network aggregation operators. A tree- based routing is used in Tiny AGgregation (TAG) [12], while a data-centric routing is used in Directed Diffusion [6]. Structural [9] and habitat [13] monitoring, the most popular applications of WSN to date, can be efficiently implemented by using those in-network aggregation systems. They enable a user to issue a query to be flooded to the network to build data forwarding and aggregation plans. Such flooding-based systems can be made more energy efficient by exploiting the spatial correlation in sensor data. Allowing for an approximate result, and not requiring an exact answer, enables designing energy-efficient mechanisms to compute in-network aggregates. Approximate results can be used in an interactive setting in which users may first ask for a rough picture of regional data before they decide to drill-down further [3]. In this scenario, not every sensed data is required This research has been funded in part by NSF grants EEC-9529152 (IMSC ERC), IIS-0238560 (PECASE), IIS-0324955 (ITR) and IIS-0307908, and unrestricted cash gifts from Microsoft. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. to compute the synopsis. Both energy efficiency and accuracy are important in time-critical monitoring. In many systems, however, higher accuracy comes at a higher energy cost. Olston et al. designed an adaptive bounded-width filter which trades precision for communication overhead [16]. Jain et al. tried to minimize resource usage under precision requirement by designing a prediction system using Dual Kalman Filter (DKF) [7]. As such, sophisticated prediction schemes can be incorporated in WSN to prevent unnecessary data transmission. Techniques such as LEACH [5], TEEN [14], APTEEN [15] use hierarchical clusters and routing to save energy. Pattem et al. studied correlation between data spatial coherence and routing efficiency using lossless compression [17]. PREMON [4] and TiNA [18] are similar to CAG. PREMON forms clusters based on a prediction model while CAG forms clusters using real-time sensor values. TiNA exploits temporal correlation in sensor data while CAG takes advantage of spatial correlation to form clusters. Deshpande et al. proposed a data acquisitional method based on statistical model [2]. Unlike CAG, their study does not take into account packet losses in the network; neither do they use clusters. CAG exploits semantic broadcast [19] in order to reduce the communication overhead by leveraging spatial correlation, the characteristic of the data distribution. CAG achieves effi- cient in-network storage and processing by allowing a unified mechanism between query routing (networking) and query processing (application). Instead of gathering and compressing all the data (lossless algorithm), CAG generates synopsis by filtering out insignificant elements in data streams (lossy algorithm) to minimize response time, storage, computation, and communication costs. Although environmental attributes such as temperature, light, and sound could be correlated over large distances, there has been no in-network aggregation algorithm exploiting spatially correlated sensor data aiming at both efficiency and precision challenges. To the best of our knowledge, CAG is the first in-network aggregation algorithm exploiting spatial cor- relation, which trades a negligible quality of result (precision) for a significant energy saving. CAG achieves this by focusing on a few representative values rather than a large number of redundant data. With denser sensor deployment, there will be even higher data correlation, which increases CAG’s efficiency and precision. CAG is a lossy clustering algorithm because CAG uses