A Comparison of Categorical Attribute Data Clustering Methods VilleHautam¨aki 1 ,AnttiP¨oll¨anen 1 , Tomi Kinnunen 1 , Kong Aik Lee 2 , Haizhou Li 2 , and Pasi Fr¨ anti 1 1 School of Computing, University of Eastern Finland, Finland 2 Institute for Infocomm Research, A*STAR, Singapore villeh@cs.uef.fi Abstract. Clustering data in Euclidean space has a long tradition and there has been considerable attention on analyzing several different cost functions. Unfortunately these result rarely generalize to clustering of categorical attribute data. Instead, a simple heuristic k-modes is the most commonly used method despite its modest performance. In this study, we model clusters by their empirical distributions and use expected en- tropy as the objective function. A novel clustering algorithm is designed based on local search for this objective function and compared against six existing algorithms on well known data sets. The proposed method provides better clustering quality than the other iterative methods at the cost of higher time complexity. 1 Introduction The goal of clustering [1] is to reveal hidden structures in a given data set by grouping similar data objects together while keeping dissimilar data objects in separated groups. Let X denote the set of data objects to be clustered. The classical clustering problem setting considers data objects in a D-dimensional vector space, X R D . The most commonly used objective function for such data is mean squared error (MSE). A generic solution is the well-known k-means method [2], which consists of two steps that are iterated until convergence. In assignment step (or E-step), all vectors are assigned to new clusters and re- estimation step (or M-step), model parameters are updated based on the new assigments. Different from vector space data, data in educational sciences, sociology, mar- ket studies, biology and bioinformatics often involves categorical attributes, also known as nominal data. For instance, a data object could be a single question- naire form that consists of multiple-choice questions. Possible outcomes of the answers can be encoded as integers. In this way, each questionnaire would be represented as an element of N D , where D is the number of questions. Unfortu- nately, since, the categories do not have any natural ordering, applying clustering methods developed for metric space data cannot be applied as such. Hamming distance is a distance function designed for categorical data. It counts the number of attributes where two vectors disagree, i.e., having different P. Fr¨anti et al. (Eds.): S+SSPR 2014, LNCS 8621, pp. 53–62, 2014. c Springer-Verlag Berlin Heidelberg 2014