1130 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 20, NO. 6, DECEMBER 2012
Fuzzy c-Means Algorithms for Very Large Data
Timothy C. Havens, Senior Member, IEEE, James C. Bezdek, Life Fellow, IEEE, Christopher Leckie,
Lawrence O. Hall, Fellow, IEEE, and Marimuthu Palaniswami, Fellow, IEEE
Abstract—Very large (VL) data or big data are any data that
you cannot load into your computer’s working memory. This is not
an objective definition, but a definition that is easy to understand
and one that is practical, because there is a dataset too big for any
computer you might use; hence, this is VL data for you. Clustering
is one of the primary tasks used in the pattern recognition and data
mining communities to search VL databases (including VL images)
in various applications, and so, clustering algorithms that scale well
to VL data are important and useful. This paper compares the ef-
ficacy of three different implementations of techniques aimed to
extend fuzzy c-means (FCM) clustering to VL data. Specifically, we
compare methods that are based on 1) sampling followed by nonit-
erative extension; 2) incremental techniques that make one sequen-
tial pass through subsets of the data; and 3) kernelized versions of
FCM that provide approximations based on sampling, including
three proposed algorithms. We use both loadable and VL datasets
to conduct the numerical experiments that facilitate comparisons
based on time and space complexity, speed, quality of approxima-
tions to batch FCM (for loadable data), and assessment of matches
between partitions and ground truth. Empirical results show that
random sampling plus extension FCM, bit-reduced FCM, and ap-
proximate kernel FCM are good choices to approximate FCM for
VL data. We conclude by demonstrating the VL algorithms on a
dataset with 5 billion objects and presenting a set of recommenda-
tions regarding the use of different VL FCM clustering schemes.
Index Terms—Big data, fuzzy c-means (FCM), kernel methods,
scalable clustering, very large (VL) data.
I. INTRODUCTION
C
LUSTERING or cluster analysis is a form of exploratory
data analysis in which data are separated into groups or
Manuscript received September 6, 2011; revised January 27, 2012;
accepted April 18, 2012. Date of publication May 25, 2012; date of cur-
rent version November 27, 2012. This work was supported in part by Grant
#1U01CA143062-01, Radiomics of Non-Small Cell Lung Cancer from the Na-
tional Institutes of Health, and in part by the Michigan State University High
Performance Computing Center and the Institute for Cyber Enabled Research.
The work of T. C. Havens was supported by the National Science Foundation
under Grant #1019343 to the Computing Research Association for the CI Fel-
lows Project.
T. C. Havens is with the Department of Computer Science and Engi-
neering, Michigan State University, East Lansing, MI 48824 USA (e-mail:
havenst@gmail.com).
J. C. Bezdek was with the Department of Electrical and Electronic En-
gineering, University of Melbourne, Parkville, Vic. 3010, Australia (e-mail:
jcbezdek@gmail.com).
C. Leckie is with the Department of Computer Science and Software En-
gineering, University of Melbourne, Parkville, Vic. 3010, Australia (e-mail:
caleckie@csse.unimelb.edu.au).
L. O. Hall is with Department of Computer Science and Engineering, Uni-
versity of South Florida, Tampa, FL 33630 USA (e-mail: hall@csee.usf.edu).
M. Palaniswami is with the Department of Electrical and Electronic En-
gineering, University of Melbourne, Parkville, Vic. 3010, Australia (e-mail:
palani@unimelb.edu.au).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TFUZZ.2012.2201485
TABLE I
HUBER’S DESCRIPTION OF DATASET SIZES [11], [12]
subsets such that the objects in each group share some similarity.
Clustering has been used as a preprocessing step to separate data
into manageable parts [1], [2], as a knowledge discovery tool [3],
[4], for indexing and compression [5], etc., and there are many
good books that describe its various uses [6]–[10]. The most
popular use of clustering is to assign labels to unlabeled data—
data for which no preexisting grouping is known. Any field
that uses or analyzes data can utilize clustering; the problem
domains and applications of clustering are innumerable.
The ubiquity of personal computing technology, especially
mobile computing, has produced an abundance of staggeringly
large datasets—Facebook alone logs over 25 terabytes (TB) of
data per day. Hence, there is a great need to cluster algorithms
that can address these gigantic datasets. In 1996, Huber [11]
classified dataset sizes as in Table I.
1
Bezdek and Hathaway
[12] added the very large (VL) category to this table in 2006.
Interestingly, data with 10
>12
objects are still unloadable on
most current (circa 2011) computers. For example, a dataset
representing 10
12
objects, each with ten features, stored in short-
integer (4 bytes) format would require 40 TB of storage (most
high-performance computers have <1 TB of main memory).
Hence, we believe that Table I will continue to be pertinent for
many years.
There are two main approaches to clustering in VL data:
distributed clustering which is based on various incremen-
tal styles, and clustering a sample found by either progres-
sive or random sampling. Each has been applied in the con-
text of FCM clustering of VL data; these ideas can also be
used for Gaussian-mixture-model (GMM) clustering with the
expectation–maximization (EM) algorithm. Both approaches
provide useful ways to accomplish two objectives: acceleration
for loadable data and approximation for unloadable data.
Consider a set of n objects O = {o
1
,...,o
n
}, e.g., human
subjects in a psychological experiment, jazz clubs in Melbourne,
or wireless sensor network nodes. Each object is typically rep-
resented by numerical feature-vector data that have the form
X = {x
1
,..., x
n
}⊂ R
d
, where the coordinates of x
i
provide
feature values (e.g., weight, length, cover charge, etc.) describ-
ing object o
i
.
A partition of the objects is defined as a set of cn values
{u
ki
}, where each value represents the degree to which object
o
i
is in the kth cluster. The c-partition is often represented
1
Huber also defined tiny as 10
2
and small as 10
4
.
1063-6706/$31.00 © 2012 IEEE