Improved Crisp and Fuzzy Clustering Techniques for Categorical Data Indrajit Saha and Anirban Mukhopadhyay Abstract—Clustering is a widely used technique in data mining application for discovering patterns in underlying data. Most traditional clustering algo- rithms are limited in handling datasets that contain categorical attributes. However, datasets with cat- egorical types of attributes are common in real life data mining problem. For these data sets, no in- herent distance measure, like the Euclidean distance, would work to compute the distance between two cat- egorical objects. In this article, we have described two algorithms based on genetic algorithm and simu- lated annealing in the field of crisp and fuzzy domain. The performance of the proposed algorithms has been compared with that of different well known categori- cal data clustering algorithms in crisp and fuzzy do- main and demonstrated for a variety of artificial and real life categorical data sets. Also statistical signifi- cance tests have been performed to establish the su- periority of the proposed algorithms. Keywords: Genetic Algorithm based Clustering, Sim- ulated Annealing based Clustering, K-medoids Algo- rithm, Fuzzy C-Medoids Algorithm, Cluster Validity Indices, Statistical significance test. 1 Introduction Genetic algorithms [1, 2, 3] are randomized search and optimization techniques guided by the principles of evolution and natural genetics, and have a large amount of implicit parallelism. GAs perform search in complex, large and multimodal landscapes, and provide near-optimal solutions for objective or fitness function of an optimization problem. The algorithm starts by initializing a population of potential solutions encoded into strings called chromosomes. Each solution has some fitness value based on which the fittest parents that would be used for reproduction are found (survival of the fittest). The new generation is created by applying genetic operators like crossover (exchange of information among parents) and mutation (sudden small change in a parent) on selected parents. Thus the quality of popula- * Date of the manuscript submission: 13th April 2008 Academy of Technology, Department of Information Tech- nology. Adisaptagram-712121, West Bengal, India. Email : indra raju@yahoo.co.in University of Kalyani, Department of Computer Science and Engineering. Kalyani-741235, West Bengal, India. Email : anirbanbuba@yahoo.com tion is improved as the number of generations increases. The process continues until some specific criterion is met or the solution converges to some optimized value. Simulated Annealing (SA) [4], a popular search algo- rithm, utilizes the principles of statistical mechanics regarding the behaviour of a large number of atom at low temperature, for finding minimal cost solutions to large optimization problems by minimizing the associated energy. In statistical mechanics, investigating the ground states or low-energy states of matter is of fundamental importance. These states are achieved at very low temperatures. However, it is not sufficient to lower the temperature alone since this results in unstable states. In the annealing process, the temperature is first raised, then decreased gradually to a very low value (T min ), while ensuring that one spends sufficient time at each temperature value. This process yields stable low-energy states. Geman and Geman [5] provided a proof that SA, if annealed sufficiently slow, converges to the global optimum. Being based on strong theory, SA has been applied in diverse areas by optimizing a single criterion. Clustering [6, 7, 8, 9] is a useful unsupervised data mining technique which partitions the input space into K regions depending on some similarity/dissimilarity metric where the value of K may or may not be known a priori. The main objective of any clustering technique is to produce a K × n partition matrix U (X ) of the given data set X , consisting of n patterns, X = x 1 ,x 2 ,...,x n . The partition matrix may be represented as U =[u k,j ], k = 1,..., K and j = 1,..., n, where u k,j is the membership of pattern x j to the kth cluster. For fuzzy clustering of the data, 0 <u k,j < 1, i.e., u k,j denotes the degree of belongingness of pattern x j to the kth cluster. The objective of the Fuzzy C-Means algorithm [10] is to maximize the global compactness of the clusters. Fuzzy C-Means clustering algorithm cannot be applied for clustering categorical data sets, where there is no natural ordering among the elements of an attribute domain. Thus no inherent distance measures, such as Euclidean distance, can be used to compute the distance between two feature vectors [11, 12, 13]. Hence it is not feasible to compute the numerical average of a set of feature vectors. To handle such categorical data sets, well known relational clustering algorithm is PAM (Partitioning Around Medoids) due to Kaufman and Rousseeuw [14]. This algorithm is based on finding K IAENG International Journal of Computer Science, 35:4, IJCS_35_4_01 ______________________________________________________________________________________ (Advance online publication: 20 November 2008)