Applied Soft Computing 30 (2015) 113–122
Contents lists available at ScienceDirect
Applied Soft Computing
j ourna l h o mepage: www.elsevier.com/locate/asoc
Non-dominated sorting genetic algorithm using fuzzy membership
chromosome for categorical data clustering
Chao-Lung Yang
∗
, R.J. Kuo, Chia-Hsuan Chien, Nguyen Thi Phuong Quyen
Department of Industrial Management, National Taiwan University of Science and Technology, Taipei, Taiwan, ROC
a r t i c l e i n f o
Article history:
Received 6 May 2014
Received in revised form
12 November 2014
Accepted 9 January 2015
Available online 31 January 2015
Keywords:
Categorical attributes
Multi-objective optimization
Genetic algorithm
Fuzzy clustering
a b s t r a c t
In this research, a data clustering algorithm named as non-dominated sorting genetic algorithm-fuzzy
membership chromosome (NSGA-FMC) based on K-modes method which combines fuzzy genetic algo-
rithm and multi-objective optimization was proposed to improve the clustering quality on categorical
data. The proposed method uses fuzzy membership value as chromosome. In addition, due to this inno-
vative chromosome setting, a more efficient solution selection technique which selects a solution from
non-dominated Pareto front based on the largest fuzzy membership is integrated in the proposed algo-
rithm. The multiple objective functions: fuzzy compactness within a cluster () and separation among
clusters (sep) are used to optimize the clustering quality. A series of experiments by using three UCI cat-
egorical datasets were conducted to compare the clustering results of the proposed NSGA-FMC with two
existing methods: genetic algorithm fuzzy K-modes (GA-FKM) and multi-objective genetic algorithm-
based fuzzy clustering of categorical attributes (MOGA (, sep)). Adjusted Rand index (ARI), , sep, and
computation time were used as performance indexes for comparison. The experimental result showed
that the proposed method can obtain better clustering quality in terms of ARI, , and sep simultaneously
with shorter computation time.
© 2015 Elsevier B.V. All rights reserved.
1. Introduction
A clustering procedure is a process to partition a given dataset
into several subsets based on a similarity or dissimilarity measure.
The standard distance measurement such as Euclidean distance is
used to calculate the distance between two points of the given
dataset in the clustering algorithm. However, there is not any natu-
ral order or distance among the parties that can be directly applied
on the categorical dataset. Categorical attribute such as gender and
blood type can be identified as ordinal or non-ordinal are very
common in real world dataset. Each categorical attribute is rep-
resented with a small set of unique categorical values such as [A, B,
AB and O] for the blood type attribute. Due to the discreteness and
unordered of categorical data, a new clustering algorithm is needed
to accommodate the dissimilarity measurement of categorical data.
Several methods were proposed to handle dissimilarity mea-
surement on categorical data. For example, converting categorical
∗
Corresponding author. Tel.: +886 227303621; fax: +886 227376344.
E-mail addresses: clyang@mail.ntust.edu.tw (C.-L. Yang),
rjkuo@mail.ntust.edu.tw (R.J. Kuo), lucky6844@gmail.com (C.-H. Chien),
quyen.ntp@gmail.com (N.T.P. Quyen).
data to numerical data and calculating the dissimilarity by the exist-
ing dissimilarity method is one way to handle the categorical data
clustering. However, if the data is nominal with no ordering, the
assigning numerical value might cause bias or misleading on clus-
tering result [1]. Another approach is counting the value occurrence
(frequency-based) to calculating the dissimilarity. For instance,
K-modes algorithm, which is modified from K-means algorithm
[2–4] uses modes instead of mean as centroid of a cluster [5].
Because the frequency-based dissimilarity can be adaptive to all
kinds of categorical data without the limitation, in this research,
K-mode clustering method is utilized on studying on categorical
datasets.
For either continual or categorical data clustering, most of clus-
tering algorithms rely on optimizing a single objective function
such as the intra-distance within a cluster to obtain the data parti-
tion. For example, genetic algorithm (GA) based clustering method
which is based on the rule of Darwinian evolution generally uses a
single objective function to search for a better data partitioning in a
dataset. The clustering result based on the single objective function
might be only good on one perspective (lower total intra-distance in
a cluster), but not be able to fulfill other clustering objective such
as enlarging the separation among clusters. Please note the ideal
clustering result might be the data partitioning where data points
http://dx.doi.org/10.1016/j.asoc.2015.01.031
1568-4946/© 2015 Elsevier B.V. All rights reserved.