Fast unsupervised learning method for rapid estimation of cluster centroids Mitchell Yuwono, Steven W. Su, Bruce Moulton, Hung Nguyen Centre for Health Technologies University of Technology, Sydney Sydney, Australia mitchellyuwono@gmail.com Abstract—Data clustering is a process where a set of data points is divided into groups of similar points. Recent approaches for data clustering have seen the development of unsupervised learning algorithms based on Particle Swarm Optimization (PSO) techniques. These include Particle Swarm Clustering (PSC) and Modified PSC (mPSC) algorithms for solving clustering problems. However, the PSC and mPSC algorithms tend to be computationally expensive when applied to datasets that have higher levels of dimensionality and large volumes. This paper presents a novel and more efficient swarm clustering strategy we call Rapid Centroid Estimation (RCE). We compare the performance of RCE with the performance of PSC and mPSC in several ways including complexity analyses and particle behavior analyses. Our benchmark testing suggests that RCE can reach a solution 274 times quicker than PSC and 270 times quicker than mPSC for a clustering task where the dataset has a dimension of 80 and a volume of 500. We also investigated particle behaviors on two-class two-dimensional datasets with volume of 500, presenting 250 data for each well-separated class with known Gaussian centers. We found that RCE converged to the appropriate centers at 70 updates on average, compared to 19802 updates for PSC and 23006 updates for mPSC. An ANOVA indicates RCE is significantly faster than both PSC and mPSC. Keywords- Particle Swarm Optimization; Clustering; Centroid estimation; Statistical Analysis; Complexity Analysis. I. INTRODUCTION Clustering can be viewed as an exploratory data analysis tool. It can enhance understanding of the data by organizing it into meaningful subsets. A good cluster is characterized by high intra-cluster similarity and low inter-cluster similarity. Similarity can be assessed using various means such as Euclidean distance, Manhalobis distance, cosine similarity, and Pearson correlation [1]. Clustering is a key tool for analyzing complicated datasets when data is recorded from multiple sources in an uncontrolled environment. Clustering methods are often used for preprocessing such data [1]. However, in cases where the data has high dimensionality and large volume, existing clustering methods can tend to become very computationally demanding. Particle swarm optimization (PSO) is a stochastic optimization approach originally proposed by Kennedy & Eberhart in 1995. It was inspired by the behavior of flocks of birds and schools of fish [2]. Data clustering using Particle swarm optimization was first proposed by Van Der Merwe & Engelbrecht in 2003 with promising results [3]. Particle Swarm Clustering (PSC), a PSO algorithm specially designed to optimize clustering problems, was proposed by Cohen & de Castro in 2006 [4]. Inspired by social interaction of humans in a global neighborhood, PSC organizes data-points into clusters based on the interdependence of each particle. Cohen & de Castro showed that PSC is superior to K-means on benchmark datasets [4]. A modified PSC algorithm called Modified PSC (mPSC) was proposed by Szabo in 2010 [5]. With the assumption that the use of the term velocity is not appropriate in the context of social neighborhood, mPSC eliminates the need for velocity and inertia weight during the update procedure. The algorithm was reported to reduce computation time while preserving cluster quality. It was conceded, however, that mPSC suffers from a long optimization time similar to its predecessor. After analyzing and reviewing the original PSC and mPSC algorithms, we propose an altered lightweight PSC-type algorithm we call Rapid Centroid Estimation (RCE). We investigate the iteration time of the algorithm using generated Gaussian datasets with dimensions of 1 to 80 and volumes of 10 to 500. The behaviors of the proposed RCE particles are investigated using two synthetically generated datasets with known Gaussian centers. The paper is organized as follows. Section I provides an introductory background to clustering, PSO, PSC, mPSC and RCE. Section II presents an overview on PSC and mPSC algorithms. Section III proposes our algorithm, RCE. Analyses of time complexity and iteration times are presented in Section IV. A comparative study between algorithms and analysis of particle behaviors is given in Section V. Conclusions and future research directions are given in Section VI. II. OVERVIEW ON THE PSC AND MPSC ALGORITHM A. Particle Swarm Clustering (PSC) According to [4], Particle Swarm Clustering (PSC) can be viewed as a special modification of PSO devised specifically for clustering tasks. This is in contrast to the general implementation of PSO where each particle represents a candidate solution. In PSC, each particle represents only a fraction of a solution: a cluster centroid prototype. It follows U.S. Government work not protected by U.S. copyright WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IEEE CEC