1130 IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 20, NO. 6, DECEMBER 2012 Fuzzy c-Means Algorithms for Very Large Data Timothy C. Havens, Senior Member, IEEE, James C. Bezdek, Life Fellow, IEEE, Christopher Leckie, Lawrence O. Hall, Fellow, IEEE, and Marimuthu Palaniswami, Fellow, IEEE AbstractVery large (VL) data or big data are any data that you cannot load into your computer’s working memory. This is not an objective definition, but a definition that is easy to understand and one that is practical, because there is a dataset too big for any computer you might use; hence, this is VL data for you. Clustering is one of the primary tasks used in the pattern recognition and data mining communities to search VL databases (including VL images) in various applications, and so, clustering algorithms that scale well to VL data are important and useful. This paper compares the ef- ficacy of three different implementations of techniques aimed to extend fuzzy c-means (FCM) clustering to VL data. Specifically, we compare methods that are based on 1) sampling followed by nonit- erative extension; 2) incremental techniques that make one sequen- tial pass through subsets of the data; and 3) kernelized versions of FCM that provide approximations based on sampling, including three proposed algorithms. We use both loadable and VL datasets to conduct the numerical experiments that facilitate comparisons based on time and space complexity, speed, quality of approxima- tions to batch FCM (for loadable data), and assessment of matches between partitions and ground truth. Empirical results show that random sampling plus extension FCM, bit-reduced FCM, and ap- proximate kernel FCM are good choices to approximate FCM for VL data. We conclude by demonstrating the VL algorithms on a dataset with 5 billion objects and presenting a set of recommenda- tions regarding the use of different VL FCM clustering schemes. Index Terms—Big data, fuzzy c-means (FCM), kernel methods, scalable clustering, very large (VL) data. I. INTRODUCTION C LUSTERING or cluster analysis is a form of exploratory data analysis in which data are separated into groups or Manuscript received September 6, 2011; revised January 27, 2012; accepted April 18, 2012. Date of publication May 25, 2012; date of cur- rent version November 27, 2012. This work was supported in part by Grant #1U01CA143062-01, Radiomics of Non-Small Cell Lung Cancer from the Na- tional Institutes of Health, and in part by the Michigan State University High Performance Computing Center and the Institute for Cyber Enabled Research. The work of T. C. Havens was supported by the National Science Foundation under Grant #1019343 to the Computing Research Association for the CI Fel- lows Project. T. C. Havens is with the Department of Computer Science and Engi- neering, Michigan State University, East Lansing, MI 48824 USA (e-mail: havenst@gmail.com). J. C. Bezdek was with the Department of Electrical and Electronic En- gineering, University of Melbourne, Parkville, Vic. 3010, Australia (e-mail: jcbezdek@gmail.com). C. Leckie is with the Department of Computer Science and Software En- gineering, University of Melbourne, Parkville, Vic. 3010, Australia (e-mail: caleckie@csse.unimelb.edu.au). L. O. Hall is with Department of Computer Science and Engineering, Uni- versity of South Florida, Tampa, FL 33630 USA (e-mail: hall@csee.usf.edu). M. Palaniswami is with the Department of Electrical and Electronic En- gineering, University of Melbourne, Parkville, Vic. 3010, Australia (e-mail: palani@unimelb.edu.au). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TFUZZ.2012.2201485 TABLE I HUBERS DESCRIPTION OF DATASET SIZES [11], [12] subsets such that the objects in each group share some similarity. Clustering has been used as a preprocessing step to separate data into manageable parts [1], [2], as a knowledge discovery tool [3], [4], for indexing and compression [5], etc., and there are many good books that describe its various uses [6]–[10]. The most popular use of clustering is to assign labels to unlabeled data— data for which no preexisting grouping is known. Any field that uses or analyzes data can utilize clustering; the problem domains and applications of clustering are innumerable. The ubiquity of personal computing technology, especially mobile computing, has produced an abundance of staggeringly large datasets—Facebook alone logs over 25 terabytes (TB) of data per day. Hence, there is a great need to cluster algorithms that can address these gigantic datasets. In 1996, Huber [11] classified dataset sizes as in Table I. 1 Bezdek and Hathaway [12] added the very large (VL) category to this table in 2006. Interestingly, data with 10 >12 objects are still unloadable on most current (circa 2011) computers. For example, a dataset representing 10 12 objects, each with ten features, stored in short- integer (4 bytes) format would require 40 TB of storage (most high-performance computers have <1 TB of main memory). Hence, we believe that Table I will continue to be pertinent for many years. There are two main approaches to clustering in VL data: distributed clustering which is based on various incremen- tal styles, and clustering a sample found by either progres- sive or random sampling. Each has been applied in the con- text of FCM clustering of VL data; these ideas can also be used for Gaussian-mixture-model (GMM) clustering with the expectation–maximization (EM) algorithm. Both approaches provide useful ways to accomplish two objectives: acceleration for loadable data and approximation for unloadable data. Consider a set of n objects O = {o 1 ,...,o n }, e.g., human subjects in a psychological experiment, jazz clubs in Melbourne, or wireless sensor network nodes. Each object is typically rep- resented by numerical feature-vector data that have the form X = {x 1 ,..., x n }⊂ R d , where the coordinates of x i provide feature values (e.g., weight, length, cover charge, etc.) describ- ing object o i . A partition of the objects is defined as a set of cn values {u ki }, where each value represents the degree to which object o i is in the kth cluster. The c-partition is often represented 1 Huber also defined tiny as 10 2 and small as 10 4 . 1063-6706/$31.00 © 2012 IEEE