Intelligent Data Analysis 15 (2011) 3–28 3
DOI 10.3233/IDA-2010-0453
IOS Press
Clustering distributed sensor data streams
using local processing and reduced
communication
Jo˜ ao Gama
a,d,∗
, Pedro Pereira Rodrigues
a,b,c
and Lu´ ıs Lopes
b,e
a
LIAAD, University of Porto, Porto, Portugal
b
Faculty of Sciences, University of Porto, Porto, Portugal
c
Faculty of Medicine, University of Porto, Porto, Portugal
d
Faculty of Economics, University of Porto, Porto, Portugal
e
CRACS – INESC, Porto, Portugal
Abstract. Nowadays applications produce infinite streams of data distributed across wide sensor networks. In this work we
study the problem of continuously maintain a cluster structure over the data points generated by the entire network. Usual
techniques operate by forwarding and concentrating the entire data in a central server, processing it as a multivariate stream.
In this paper, we propose DGClust, a new distributed algorithm which reduces both the dimensionality and the communication
burdens, by allowing each local sensor to keep an online discretization of its data stream, which operates with constant update
time and (almost) fixed space. Each new data point triggers a cell in this univariate grid, reflecting the current state of the
data stream at the local site. Whenever a local site changes its state, it notifies the central server about the new state it is in.
This way, at each point in time, the central sitehas the global multivariate state of the entire network. To avoid monitoring all
possible states, which is exponential in the number of sensors, the central site keeps a small list of counters of the most frequent
global states. Finally, a simple adaptive partitional clustering algorithm is applied to the frequent states central points in order
to provide an anytime definition of the clusters centers. The approach is evaluated in the context of distributed sensor networks,
focusing on three outcomes: loss to real centroids, communication prevention, and processing reduction. The experimental
work on synthetic data supports our proposal, presenting robustness to a high number of sensors, and the application to real
data from physiological sensors exposes the aforementioned advantages of the system.
Keywords: Online adaptive clustering, distributed data streams, sensor networks, incremental discretization, frequent items
monitoring
1. Introduction
Data gathering and analysis have become ubiquitous, in the sense that our world is evolving into
a setting where all devices, as small as they may be, will be able to include sensing and processing
ability. Nowadays applications produce infinite streams of data distributed across wide sensor networks
(see example on Fig. 1). The aim of the analysis addressed in this work is to continuously maintain
a cluster structure over the data points generated by the entire network, as clustering of sensor data
∗
Corresponding author: Jo˜ ao Gama, LIAAD – INESC Porto L.A. Rua de Ceuta, 118-6 andar, 4050-190, Porto, Portugal.
E-mail: jgama@fep.up.pt.
1088-467X/11/$27.50 © 2011 – IOS Press and the authors. All rights reserved