IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 5, NO. 4, AUGUST 2011 707 Distributed Clustering Using Wireless Sensor Networks Pedro A. Forero, Student Member, IEEE, Alfonso Cano, Member, IEEE, and Georgios B. Giannakis, Fellow, IEEE Abstract—Clustering spatially distributed data is well moti- vated and especially challenging when communication to a central processing unit is discouraged, e.g., due to power constraints. Distributed clustering schemes are developed in this paper for both deterministic and probabilistic approaches to unsupervised learning. The centralized problem is solved in a distributed fashion by recasting it to a set of smaller local clustering problems with consensus constraints on the cluster parameters. The resulting iterative schemes do not exchange local data among nodes, and rely only on single-hop communications. Performance of the novel algorithms is illustrated with simulated tests on synthetic and real sensor data. Surprisingly, these tests reveal that the distributed algorithms can exhibit improved robustness to initialization than their centralized counterparts. Index Terms—Clustering methods, distributed algorithms, ex- pectation–maximization (EM) algorithms, iterative methods, wire- less sensor networks. I. INTRODUCTION T HE development of small, low-cost, intelligent sensors with communication capabilities has prompted the emer- gence of wireless sensor networks (WSNs) in applications in- cluding environmental monitoring, surveillance, tracking, and inference tasks in bio-informatics [20], [25], [26]. When using a WSN as an exploratory infrastructure, it is often desired to infer hidden structures in distributed data collected by the sen- sors. With each sensor having available a set of unlabeled ob- servations drawn from a known number of classes, the goal of the present paper is to design local clustering rules that perform at least as well as global ones, which rely on all observations being centrally available. Because low-cost sensors must op- erate under stringent power constraints, transmitting all obser- vations to a central location may be infeasible. This motivates Manuscript received June 01, 2010; revised November 15, 2010; accepted January 25, 2011. Date of publication February 14, 2011; date of current ver- sion July 20, 2011. This work was supported in part by the National Science Foundation (NSF) under Grants CCF 0830480 and CON 014658 and also in part through collaborative participation in the Communications and Networks Consortium sponsored by the U.S. Army Research Laboratory under the Collab- orative Technology Alliance Program, Cooperative Agreement DAAD19-01-2- 0011. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies of the Army Re- search Laboratory or the U.S. Government. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Anna Scaglione. The authors are with the Department of Electrical and Computer Engi- neering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: forer002@umn.edu; alfonso@umn.edu; georgios@umn.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTSP.2011.2114324 looking for in-network clustering algorithms requiring informa- tion exchanges among single-hop neighbors only. Focus is placed on partitional (as opposed to hierarchical) clustering algorithms, which yield a single partitioning of the data described by a fixed number of parameters [30]. With these parameters being less than the available data, partitional clustering can afford parsimonious distributed implementa- tions of deterministic and probabilistic approaches. A popular centralized deterministic partitional clustering approach is offered by the K-means algorithm, which features simple, and fast-convergent iterations [19]. Alternatively, clustering can be viewed as the byproduct of a density estimation problem by introducing a parametric probabilistic model governing the data generation; e.g., a Gaussian mixture model (GMM) [9, Ch. 10]. Density estimation problems are of further interest in the clustering context, because they provide extra information in the form of confidence on the data-to-cluster assignment. When the sought density is described by a finite number of parameters, a popular scheme for estimating them using the maximum-likelihood (ML) approach is the centralized ex- pectation–maximization (EM) algorithm. The EM algorithm has well-documented merits because it is computationally affordable, and offers convergence guarantees [7]. Parallel and distributed implementations of the K-means (DKM) and EM (DEM) algorithms have risen most often because of the need to deal with large data sets. However, most existing schemes are agnostic to the network communi- cation constraints [8], [22], [31]. In the WSN context, various probabilistic approaches have been reported leading to: an in- cremental (I-) DEM scheme [23]; a gossip-based scheme [18]; a scheme based on consensus averaging [14]; a scheme based on junction trees and related topologies [29]; and a scheme based on the alternating direction method of multipliers [12]. Except for [12] and [29], all these works are confined to parameter estimation when the data probability density function (pdf) is modeled as a finite mixture of Gaussian density functions, a case where local estimators are available in closed form. In addition, [23] and [29] are confined to specific communication network topologies (loops or trees). This paper presents and analyzes novel distributed algorithms for clustering observations collected by spatially distributed re- source-aware sensors, which exchange only sufficient informa- tion with their one-hop neighbors. Viewing first the data as de- terministic, a distributed version of the centralized K-means al- gorithm is developed. In par with the centralized K-means al- gorithm, the novel DKM scheme iterates over the variables of a consensus-based decentralized version of the global classifi- cation cost. Subsequently, viewing the data as random draws from a probabilistic model, the underlying data pdf is modeled 1932-4553/$26.00 © 2011 IEEE