Kỷ yếu Hội nghị Khoa học Quốc gia lần thứ IX “Nghiên cứu cơ bản và ứng dụng Công nghệ thông tin (FAIR'9)”; Cần Thơ, ngày 4-5/8/2016 DOI: 10.15625/vap.2016.0005 AN INFORMATION-THEORETIC METRIC BASED METHOD FOR SELECTING CLUSTERING ATTRIBUTE Pham Cong Xuyen, Đo Si Truong , Nguyen Thanh Tung Lac Hong University pcxuyen@lhu.edu.vn, truongds@lhu.edu.vn, nttung@lhu.edu.vn ABSTRACT—Clustering problem appears in many different fields like Data Mining, Pattern Recognition, Bioinfor-matics, etc. The basic objective of clustering is to group objects into clusters so that objects in the same cluster are more similar to one another than they are to objects in other clusters. Recently, many researchers have contributed to categorical data clustering, where data objects are made up of non-numerical attributes. Especially, rough set theory based attribute selection clustering approaches for categorical data have attracted much attention. The key to these approaches is how to select only one attribute that is the best to cluster the objects at each time from many candidates of attributes. In this paper, we review three rough set based techniques: Total Roughness (TR), Min-Min Roughness (MMR) and Maximum Dependency Attribute (MDA), and propose MAMD (Minimum value of Average Mantaras Distance), an alternative algorithm for hierarchical clustering attribute selection. MAMD uses Mantaras metric which is an information-theoretic metric on the set of partitions of a finite set of objects and seeks to determine a clustering attribute such that the average distance between the partition generated by this attribute and the partitions generated by other attributes of the objects has a minimum value. To evaluate and compare MAMD with three rough set based techniques, we use the concept of average intra-class similarity to measure the clustering quality of selected attribute. The experiment results show that the clustering quality of the attribute selected by our method is higher than that of attributes selected by TR, MMR and MDA methods. Keywords— Data Mining, Hierarchical clustering, Categorical data, Rough sets, Clustering attribute selection. I. INTRODUCTION During the last two decades, data mining has emerged as a rapidly growing interdisciplinary field which merges together databases, statistics, machine learning and related areas in order to extract useful knowledge from data (Han and Kamber, 2006). Clustering is one of fundamental operations in data mining. It can be defined as as follows. Let { } be the set of objects, where each is an dimensional vector in the given feature space. The clustering activity is to find clusters/groups of objects in such a way that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity [6]. Clustering problem appears in many different domains such as pattern recognition, computer vision, biology, medicine, information retrieval, etc. At present, there exist a large number of clustering algorithms in the literature. Types of clustering are divided broadly into hierarchical and non-hierarchical clustering. Non-hierarchical clustering methods create a single partition of the dataset optimizing a criterion function. Hierarchical clustering methods create a sequence of nested partitions of the dataset. Most of the earlier works on clustering has been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. However, data mining applications frequently involve many datasets that also consist of categorical attributes on which distance functions are not naturally defined. Recently, clustering categorical data have attracted much attention from the data mining research community [1, 4, 7, 8, 11, 12, 14]. One of the techniques of categorical data clustering was implemented by introducing a series of clustering attributes, in which one of the attributes is selected and used to divide the objects at each time until all objects are clustered. To this, one practical problem is faced: for many candidates of attributes, we need to select only one at each time that is the best attribute to cluster the objects according to some predefined criterion. Recently, there has been works in the area of applying rough set theory to handle uncertainty in the process of selecting clustering attributes [7, 9, 11, 12]. Mazlack et al. [11] proposed a technique using the average of the accuracy of approximation in the rough set theory called total roughness (TR), where the higher the total roughness is, the higher the accuracy of selecting clustering attribute. Parmar et al. [12] proposed the MMR (Min–Min–Roughness) algorithm, which is a „„purity‟‟ rough set-based hierarchical clustering algorithm for categorical data. The MMR algorithm determines the clustering attribute by MR (Min–Roughness) criterion. However, as Herawan et al. has proven in [7], MMR is the complementary of TR and with this technique, the complexity is still an issue due to all attributes are considered to obtain the clustering attribute. In order to solve these problems, Herawan et al. [7] proposed a new technique called maximum dependency attributes (MDA), which is based on rough set theory by taking into