International Journal of Research and Reviews in Soft and Intelligent Computing (IJRRSIC) Vol. 2, No. 3, September 2012, ISSN: 2046-6412 161 © Science Academy Publisher, United Kingdom www.sciacademypublisher.com/journals/index.php/IJRRSIC Restarted Simulated Annealing Particle Swarm Optimization used in Cluster Analysis Yudong Zhang and Lenan Wu School of Information Science and Engineering, Southeast University, Nanjing China Email: zhangyudongnuaa@gmail.com, wuln@seu.edu.cn Abstract – In order to solve the cluster analysis problem more efficiently, we presented a new approach based on Particle Swarm Optimization Sequence Quadratic Programming (RSAPSO). First, we created the optimization model using the variance ratio criterion (VRC) as fitness function. Second, RSAPSO was introduced to find the maximal point of the VRC. The experimental dataset contained 400 data of 4 groups with three different levels of overlapping degrees: non-overlapping, partial overlapping, and severely overlapping. We compared the RSAPSO with genetic algorithm (GA) and combinatorial particle swarm optimization (CPSO). Each algorithm was run 20 times. The results showed that RSAPSO could found the largest VRC values among all three algorithms, and meanwhile it cost the least time. It can conclude that RSAPSO is effective and rapid for the cluster analysis problem. Keywords – Cluster Analysis, Variance Ratio Criterion, Genetic Algorithm, Particle Swarm Optimization, Sequence Quadratic Programming 1. Introduction Cluster analysis is the assignment of a set of observations into subsets without any priori knowledge so that observations in the same cluster are similar to each other than to those in other clusters [1]. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields [2], including machine learning [3], data mining [4], pattern recognition [5], image analysis [6] and bioinformatics [7]. Cluster analysis can be achieved by various algorithms that differ significantly. Those methods can be basically classified into four categories: 1) Hierarchical Methods. They find successive clusters using previously established clusters. They can be further divided into the agglomerative methods and the divisive methods [8]. Agglomerative algorithms start with one-point clusters and recursively merges two or more most appropriate clusters [9]. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters [10]. 2) Partition Methods. They generate a single partition of data with a specified or estimated number of non overlapping clusters, in an attempt to recover natural groups present in the data [11]. 3) Density-based Methods. They are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. DBSCAN [12] is the typical algorithm of this kind. 4) Subspace Methods. They look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes [13]. In this study, we focus our attention on Partition Clustering methods. The K-means clustering [14] and the fuzzy c-means clustering (FCM) [15] are two typical algorithms of this type. They are iterative algorithms and the solution obtained depends on the selection of the initial partition and may converge to a local minimum of criterion function value if the initial partition is not properly chosen [16]. Branch and bound algorithm was proposed to find the global optimum clustering. However, it takes too much computation time [17]. In the last decade, evolutionary algorithms were proposed to clustering problem since they are not sensitive to initial values and able to jump out of local minimal point. For example, Elcio Sabato de Abreu e Silva et al. [18] proposed the application of a genetic algorithm (GA) for determining global minima to be used as seeds for a higher level ab initio method analysis such as density function theory (DFT). Water clusters were used as a test case and for the initial guesses four empirical potentials (TIP3P, TIP4P, TIP5P and ST2) were considered for the GA calculations. Two types of analysis were performed namely rigid (DFT_RM) and non rigid (DFT_NRM) molecules for the corresponding structures and energies. For the DFT analysis, the PBE exchange correlation functional and the large basis set A-PVTZ had been used. All structures and their respective energies calculated through the GA method, DFT_RM and DFT_NRM are compared and discussed. The proposed methodology showed to be very efficient in order to have quasi accurate global minima on the level of ab initio calculations and the data are discussed in the light of previously published results with particular attention to (H 2 O)n clusters. Lin et al. [19] pointed out that k-Anonymity has been widely adopted as a model for protecting public released microdata from individual identification. Their work