Applied Soft Computing 30 (2015) 113–122 Contents lists available at ScienceDirect Applied Soft Computing j ourna l h o mepage: www.elsevier.com/locate/asoc Non-dominated sorting genetic algorithm using fuzzy membership chromosome for categorical data clustering Chao-Lung Yang ∗ , R.J. Kuo, Chia-Hsuan Chien, Nguyen Thi Phuong Quyen Department of Industrial Management, National Taiwan University of Science and Technology, Taipei, Taiwan, ROC a r t i c l e i n f o Article history: Received 6 May 2014 Received in revised form 12 November 2014 Accepted 9 January 2015 Available online 31 January 2015 Keywords: Categorical attributes Multi-objective optimization Genetic algorithm Fuzzy clustering a b s t r a c t In this research, a data clustering algorithm named as non-dominated sorting genetic algorithm-fuzzy membership chromosome (NSGA-FMC) based on K-modes method which combines fuzzy genetic algo- rithm and multi-objective optimization was proposed to improve the clustering quality on categorical data. The proposed method uses fuzzy membership value as chromosome. In addition, due to this inno- vative chromosome setting, a more efﬁcient solution selection technique which selects a solution from non-dominated Pareto front based on the largest fuzzy membership is integrated in the proposed algo- rithm. The multiple objective functions: fuzzy compactness within a cluster () and separation among clusters (sep) are used to optimize the clustering quality. A series of experiments by using three UCI cat- egorical datasets were conducted to compare the clustering results of the proposed NSGA-FMC with two existing methods: genetic algorithm fuzzy K-modes (GA-FKM) and multi-objective genetic algorithm- based fuzzy clustering of categorical attributes (MOGA (, sep)). Adjusted Rand index (ARI), , sep, and computation time were used as performance indexes for comparison. The experimental result showed that the proposed method can obtain better clustering quality in terms of ARI, , and sep simultaneously with shorter computation time. © 2015 Elsevier B.V. All rights reserved. 1. Introduction A clustering procedure is a process to partition a given dataset into several subsets based on a similarity or dissimilarity measure. The standard distance measurement such as Euclidean distance is used to calculate the distance between two points of the given dataset in the clustering algorithm. However, there is not any natu- ral order or distance among the parties that can be directly applied on the categorical dataset. Categorical attribute such as gender and blood type can be identiﬁed as ordinal or non-ordinal are very common in real world dataset. Each categorical attribute is rep- resented with a small set of unique categorical values such as [A, B, AB and O] for the blood type attribute. Due to the discreteness and unordered of categorical data, a new clustering algorithm is needed to accommodate the dissimilarity measurement of categorical data. Several methods were proposed to handle dissimilarity mea- surement on categorical data. For example, converting categorical ∗ Corresponding author. Tel.: +886 227303621; fax: +886 227376344. E-mail addresses: clyang@mail.ntust.edu.tw (C.-L. Yang), rjkuo@mail.ntust.edu.tw (R.J. Kuo), lucky6844@gmail.com (C.-H. Chien), quyen.ntp@gmail.com (N.T.P. Quyen). data to numerical data and calculating the dissimilarity by the exist- ing dissimilarity method is one way to handle the categorical data clustering. However, if the data is nominal with no ordering, the assigning numerical value might cause bias or misleading on clus- tering result [1]. Another approach is counting the value occurrence (frequency-based) to calculating the dissimilarity. For instance, K-modes algorithm, which is modiﬁed from K-means algorithm [2–4] uses modes instead of mean as centroid of a cluster [5]. Because the frequency-based dissimilarity can be adaptive to all kinds of categorical data without the limitation, in this research, K-mode clustering method is utilized on studying on categorical datasets. For either continual or categorical data clustering, most of clus- tering algorithms rely on optimizing a single objective function such as the intra-distance within a cluster to obtain the data parti- tion. For example, genetic algorithm (GA) based clustering method which is based on the rule of Darwinian evolution generally uses a single objective function to search for a better data partitioning in a dataset. The clustering result based on the single objective function might be only good on one perspective (lower total intra-distance in a cluster), but not be able to fulﬁll other clustering objective such as enlarging the separation among clusters. Please note the ideal clustering result might be the data partitioning where data points http://dx.doi.org/10.1016/j.asoc.2015.01.031 1568-4946/© 2015 Elsevier B.V. All rights reserved.