Intelligent Data Analysis 15 (2011) 3–28 3 DOI 10.3233/IDA-2010-0453 IOS Press Clustering distributed sensor data streams using local processing and reduced communication Jo˜ ao Gama a,d, , Pedro Pereira Rodrigues a,b,c and Lu´ ıs Lopes b,e a LIAAD, University of Porto, Porto, Portugal b Faculty of Sciences, University of Porto, Porto, Portugal c Faculty of Medicine, University of Porto, Porto, Portugal d Faculty of Economics, University of Porto, Porto, Portugal e CRACS – INESC, Porto, Portugal Abstract. Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream. In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in. This way, at each point in time, the central sitehas the global multivariate state of the entire network. To avoid monitoring all possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks, focusing on three outcomes: loss to real centroids, communication prevention, and processing reduction. The experimental work on synthetic data supports our proposal, presenting robustness to a high number of sensors, and the application to real data from physiological sensors exposes the aforementioned advantages of the system. Keywords: Online adaptive clustering, distributed data streams, sensor networks, incremental discretization, frequent items monitoring 1. Introduction Data gathering and analysis have become ubiquitous, in the sense that our world is evolving into a setting where all devices, as small as they may be, will be able to include sensing and processing ability. Nowadays applications produce infinite streams of data distributed across wide sensor networks (see example on Fig. 1). The aim of the analysis addressed in this work is to continuously maintain a cluster structure over the data points generated by the entire network, as clustering of sensor data Corresponding author: Jo˜ ao Gama, LIAAD – INESC Porto L.A. Rua de Ceuta, 118-6 andar, 4050-190, Porto, Portugal. E-mail: jgama@fep.up.pt. 1088-467X/11/$27.50 © 2011 – IOS Press and the authors. All rights reserved