IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 707
Distributed Clustering Using Wireless
Sensor Networks
Pedro A. Forero, Student Member, IEEE, Alfonso Cano, Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE
Abstract—Clustering spatially distributed data is well moti-
vated and especially challenging when communication to a central
processing unit is discouraged, e.g., due to power constraints.
Distributed clustering schemes are developed in this paper for
both deterministic and probabilistic approaches to unsupervised
learning. The centralized problem is solved in a distributed fashion
by recasting it to a set of smaller local clustering problems with
consensus constraints on the cluster parameters. The resulting
iterative schemes do not exchange local data among nodes, and
rely only on single-hop communications. Performance of the novel
algorithms is illustrated with simulated tests on synthetic and real
sensor data. Surprisingly, these tests reveal that the distributed
algorithms can exhibit improved robustness to initialization than
their centralized counterparts.
Index Terms—Clustering methods, distributed algorithms, ex-
pectation–maximization (EM) algorithms, iterative methods, wire-
less sensor networks.
I. INTRODUCTION
T
HE development of small, low-cost, intelligent sensors
with communication capabilities has prompted the emer-
gence of wireless sensor networks (WSNs) in applications in-
cluding environmental monitoring, surveillance, tracking, and
inference tasks in bio-informatics [20], [25], [26]. When using
a WSN as an exploratory infrastructure, it is often desired to
infer hidden structures in distributed data collected by the sen-
sors. With each sensor having available a set of unlabeled ob-
servations drawn from a known number of classes, the goal of
the present paper is to design local clustering rules that perform
at least as well as global ones, which rely on all observations
being centrally available. Because low-cost sensors must op-
erate under stringent power constraints, transmitting all obser-
vations to a central location may be infeasible. This motivates
Manuscript received June 01, 2010; revised November 15, 2010; accepted
January 25, 2011. Date of publication February 14, 2011; date of current ver-
sion July 20, 2011. This work was supported in part by the National Science
Foundation (NSF) under Grants CCF 0830480 and CON 014658 and also in
part through collaborative participation in the Communications and Networks
Consortium sponsored by the U.S. Army Research Laboratory under the Collab-
orative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-
0011. The U.S. Government is authorized to reproduce and distribute reprints
for Government purposes notwithstanding any copyright notation thereon. The
views and conclusions contained in this document are those of the authors and
should not be interpreted as representing the official policies of the Army Re-
search Laboratory or the U.S. Government. The associate editor coordinating
the review of this manuscript and approving it for publication was Prof. Anna
Scaglione.
The authors are with the Department of Electrical and Computer Engi-
neering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail:
forer002@umn.edu; alfonso@umn.edu; georgios@umn.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JSTSP.2011.2114324
looking for in-network clustering algorithms requiring informa-
tion exchanges among single-hop neighbors only.
Focus is placed on partitional (as opposed to hierarchical)
clustering algorithms, which yield a single partitioning of the
data described by a fixed number of parameters [30]. With
these parameters being less than the available data, partitional
clustering can afford parsimonious distributed implementa-
tions of deterministic and probabilistic approaches. A popular
centralized deterministic partitional clustering approach is
offered by the K-means algorithm, which features simple, and
fast-convergent iterations [19]. Alternatively, clustering can
be viewed as the byproduct of a density estimation problem
by introducing a parametric probabilistic model governing the
data generation; e.g., a Gaussian mixture model (GMM) [9,
Ch. 10]. Density estimation problems are of further interest in
the clustering context, because they provide extra information
in the form of confidence on the data-to-cluster assignment.
When the sought density is described by a finite number of
parameters, a popular scheme for estimating them using the
maximum-likelihood (ML) approach is the centralized ex-
pectation–maximization (EM) algorithm. The EM algorithm
has well-documented merits because it is computationally
affordable, and offers convergence guarantees [7].
Parallel and distributed implementations of the K-means
(DKM) and EM (DEM) algorithms have risen most often
because of the need to deal with large data sets. However,
most existing schemes are agnostic to the network communi-
cation constraints [8], [22], [31]. In the WSN context, various
probabilistic approaches have been reported leading to: an in-
cremental (I-) DEM scheme [23]; a gossip-based scheme [18]; a
scheme based on consensus averaging [14]; a scheme based on
junction trees and related topologies [29]; and a scheme based
on the alternating direction method of multipliers [12]. Except
for [12] and [29], all these works are confined to parameter
estimation when the data probability density function (pdf) is
modeled as a finite mixture of Gaussian density functions, a
case where local estimators are available in closed form. In
addition, [23] and [29] are confined to specific communication
network topologies (loops or trees).
This paper presents and analyzes novel distributed algorithms
for clustering observations collected by spatially distributed re-
source-aware sensors, which exchange only sufficient informa-
tion with their one-hop neighbors. Viewing first the data as de-
terministic, a distributed version of the centralized K-means al-
gorithm is developed. In par with the centralized K-means al-
gorithm, the novel DKM scheme iterates over the variables of
a consensus-based decentralized version of the global classifi-
cation cost. Subsequently, viewing the data as random draws
from a probabilistic model, the underlying data pdf is modeled
1932-4553/$26.00 © 2011 IEEE