On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm Michael K. Ng, Mark Junjie Li, Joshua Zhexue Huang, and Zengyou He Abstract—This correspondence describes extensions to the k-modes algorithm for clustering categorical data. By modifying a simple matching dissimilarity measure for categorical objects, a heuristic approach was developed in [4], [12] which allows the use of the k-modes paradigm to obtain a cluster with strong intrasimilarity and to efficiently cluster large categorical data sets. The main aim of this paper is to rigorously derive the updating formula of the k-modes clustering algorithm with the new dissimilarity measure and the convergence of the algorithm under the optimization framework. Index Terms—Data mining, clustering, k-modes algorithm, categorical data. Ç 1 INTRODUCTION SINCE first published in 1997, the k-modes algorithm [5], [6] has become a popular technique in solving categorical data clustering problems in different application domains (e.g., [1], [11]). The k-modes algorithm extends the k-means algorithm [9] by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to update modes in the clustering process to minimize the clustering cost function. These extensions have removed the numeric-only limitation of the k-means algorithm and enable the k-means clustering process to be used to efficiently cluster large categorical data sets from real world databases. An equivalent nonparametric approach to deriving clusters from categorical data is presented in [2]. A note in [8] discusses the equivalence of the two independently developed k-modes approaches. The distance between two objects computed with the simple matching similarity measure is either 0 or 1. This often results in clusters with weak intrasimilarity. Recently, He et al. [4] and San et al. [12] independently introduced a new dissimilarity measure to the k-modes clustering process to improve the accuracy of the clustering results. Their main idea is to use the relative attribute frequencies of the cluster modes in the similarity measure in the k-modes objective function. This modification allows the algorithm to recognize a cluster with weak intrasimilarity and, therefore, assign less similar objects to such a cluster so that the generated clusters have strong intrasimilarities. Experimental results in [4] and [12] have shown that the modified k-modes algorithm is very effective. The aim of this paper is to give a rigorous proof that the object cluster membership assignment method and the mode updating formulae under the new dissimilarity measure indeed minimize the objective function. We also prove that, using the new dissimilarity measure, the convergence of the clustering process is guaranteed. In [4] and [12], the new dissimilarity measure was introduced heuristically. With the formal proofs, we assure that the modified k-modes algorithm can be used safely. The outline of this paper is as follows: In Section 2, we review the k-modes algorithm. In Section 3, we study and analyze the k-modes algorithm with the new similarity measure. In Section 4, examples are given to illustrate the effectiveness of the k-modes algorithm with the new similarity measure. Finally, a concluding remark is given in Section 5. 2 THE k-MODES ALGORITHM We assume the set of objects to be clustered is stored in a database table T defined by a set of attributes, A 1 ;A 2 ; ... ;A m . Each attribute A j describes a domain of values, denoted by DOMðA j Þ, associated with a defined semantic and a data type. In this paper, we only consider two general data types, numeric and categorical, and assume other types used in database systems can be mapped to one of these two types. The domains of attributes associated with these two types are called numeric and categorical, respectively. A numeric domain consists of real numbers. A domain DOMðA j Þ is defined as categorical if it is finite and unordered, e.g., for any a;b 2 DOMðA j Þ, either a ¼ b or a 6¼ b, see, for instance, [3]. An object X in T can be logically represented as a conjunction of attribute-value pairs ½A 1 ¼ x 1 ^½A 2 ¼ x 2 ^^½A m ¼ x m , where x j 2 DOMðA j Þ for 1 j m. Without ambiguity, we represent X as a vector ½x 1 ;x 2 ; ;x m . X is called a categorical object if it has only categorical values. We consider that every object has exactly m attribute values. If the value of an attribute A j is missing, then we denote the attribute value of A j by . Let X ¼fX 1 ;X 2 ; ... ;X n g be a set of n objects. Object X i is represented as ½x i;1 ;x i;2 ; ... ;x i;m . We write X i ¼ X k if x i;j ¼ x k;j for 1 j m. The relation X i ¼ X k does not mean that X i and X k are the same object in the real-world database, but rather that the two objects have equal values in attributes A 1 ;A 2 ; ... ;A m . The k-modes algorithm, introduced and developed in [5], [6], has made the following modifications to the k-means algorithm: 1) using a simple matching dissimilarity measure for categorical objects, 2) replacing the means of clusters with the modes, and 3) using a frequency-based method to find the modes. These modifications have removed the numeric-only limitation of the k-means algorithm but maintain its efficiency in clustering large categorical data sets [6]. Let X and Y be two categorical objects represented by ½x 1 ;x 2 ; ;x m and ½y 1 ;y 2 ; ;y m , respectively. The simple match- ing dissimilarity measure between X and Y is defined as follows: dðX;Y Þ X m j¼1 ðx j ;y j Þ; where ðx j ;y j Þ¼ 0; x j ¼ y j 1; x j 6¼ y j : ð1Þ It is easy to verify that the function d defines a metric space on the set of categorical objects. Traditionally, the simple matching approach is often used in binary variables which are converted from categorical variables [10, pp. 28-29]. We note that d is also a kind of generalized Hamming distance. The k-modes algorithm uses the k-means paradigm to cluster categorical data. The objective of clustering a set of n categorical objects into k clusters is to find W and Z that minimize IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, NO. 3, MARCH 2007 503 . M.K. Ng and M.J. Li are with the Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Hong Kong. E-mail: {mng, jjli}@math.hkbu.edu.hk. . J.Z. Huang is with the E-Business Technology Institute, The University of Hong Kong, Pokfulam Road, Hong Kong. E-mail: jhuang@eti.hku.hk. . Z. He is with the Department of Computer Science and Engineering, Harbin Institute of Technology, 92 West Dazhi Street, PO Box 315, Harbin 150001, China. E-mail: zengyouhe@yahoo.com. Manuscript received 7 Jan. 2006; revised 13 June 2006; accepted 31 July 2006; published online 15 Jan. 2007. Recommended for acceptance by M. Figueiredo. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-0010-0106. 0162-8828/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society