1 Abstract—Clustering is a well known data mining technique used in pattern recognition and information retrieval. The initial dataset to be clustered can either contain categorical or numeric data. Each type of data has its own specific clustering algorithm. In this context, two algorithms are proposed: the k-means for clustering numeric datasets and the k-modes for categorical datasets. The main encountered problem in data mining applications is clustering categorical dataset so relevant in the datasets. One main issue to achieve the clustering process on categorical values is to transform the categorical attributes into numeric measures and directly apply the k-means algorithm instead the k-modes. In this paper, it is proposed to experiment an approach based on the previous issue by transforming the categorical values into numeric ones using the relative frequency of each modality in the attributes. The proposed approach is compared with a previously method based on transforming the categorical datasets into binary values. The scalability and accuracy of the two methods are experimented. The obtained results show that our proposed method outperforms the binary method in all cases. Keywords—Clustering, k-means, categorical datasets, pattern recognition, unsupervised learning, knowledge discovery. I. INTRODUCTION HE considerable increase of information technology devices manufacturing and the advances in scientific data collection methods lead to the creation of growing data repositories. Besides, traditional exploratory methods have shown their inefficiency in dealing with such data quantities to discover new findings. Thus, recent developed knowledge- discovery systems should implement an innovative and appropriate machine learning algorithms to explore these huge structures and to identify initially hidden patterns [1], [2]. In data mining, clustering [3] is the most commonly encountered knowledge-discovery technique applied in information retrieval and pattern recognition. It refers to unsupervised learning aiming to partition a dataset composed of N individuals embedded in and-dimensional space into K distinct clusters without any prior knowledge about the distribution of the resulting clusters. The resulting data points in the same cluster are more similar to each other than to data points in other clusters. Three sub-problems are addressed by this process: (i) the similarity measure (distance) used to compare the data points, (ii) the iterative process of the designed algorithm to discover the clusters in an unsupervised way to guarantee the efficiency and (iii) derive a significant Semeh Ben Salem, Sami Naouali and Moetez Sallami are with the Virtual Reality and Information Technology (VRIT), Military Academy of Fandouk Jedid, Tunisia (e-mail: semeh.bensalem@yahoo.fr, snaouali@gmail.com, Sellami-Moetez@outlook.fr). description for each obtained cluster to extract the corresponding proprieties and knowledge. k-means is a well known clustering algorithm proposed for numeric datasets (containing numeric values) which makes it not adapted for clustering categorical datasets. This fact is a great restriction and limited the performance of this algorithm since, in many data mining applications, most considered datasets may contain categorical values. To deal with categorical datasets, the k-means was extended to obtain the k- modes algorithm that will be detailed in the next section. However, one other interesting issue is to convert the categorical data into numeric values and directly apply the k- means algorithm which is also interesting to discover. This paper is organized as follows: in the second section, we present previous approaches towards clustering categorical data with their limits and provides a detailed description of the k-means that will be adopted in this study. In the third section, our proposed approach is detailed. Experimental results and discussion are provided in the fourth section, and the last section is devoted to the conclusion and perspectives. II. LITERATURE REVIEW IN CLUSTERING CATEGORICAL DATASETS A. Categorical Clustering Algorithms Although several proposals were made in the context of clustering categorical datasets, the most popular developed algorithm is the k-mode [4] and its variants [5]-[7]. It is an extension of the k-means algorithm where the Euclidean distance is replaced by the simple matching dissimilarity function, more suitable for categorical values, and the means by the modes, to identify the most representative element in a cluster (centroid). Besides, the modes are based on a frequency based method used in each iteration to update the centroids. The k-prototype algorithm [4] permits clustering mixed datasets with categorical and numeric values. Numerous variants were also proposed: the fuzzy k-modes algorithm [8] and the fuzzy k-modes algorithm with fuzzy centroids [9]. However, the main limitation when using the simple dissimilarity matching distance is that it does not provide efficient results since the simple matching often results in clusters with weak intra-similarity [10]. In [11], the authors showed that the similarity between two categorical values can also be referred as their co-occurrence according to a common value or a set of values which represents the second techniques to clustering categorical data considering the co-occurrence of the attributes. The most popular algorithm that falls into this category is the ROCK [12]. It measures the similarity between the categorical Semeh Ben Salem, Sami Naouali, Moetez Sallami Clustering Categorical Data Using the K-Means Algorithm and the Attribute’s Relative Frequency T World Academy of Science, Engineering and Technology International Journal of Computer and Systems Engineering Vol:11, No:6, 2017 709 International Scholarly and Scientific Research & Innovation 11(6) 2017 scholar.waset.org/1307-6892/10007221 International Science Index, Computer and Systems Engineering Vol:11, No:6, 2017 waset.org/Publication/10007221