J.F. Martínez-Trinidad et al. (Eds.): CIARP 2006, LNCS 4225, pp. 491 – 501, 2006. © Springer-Verlag Berlin Heidelberg 2006 Conceptual K-Means Algorithm Based on Complex Features I.O. Ayaquica-Martínez, J.Fco. Martínez-Trinidad, and J. Ariel Carrasco-Ochoa National Institute of Astrophysics, Optics and Electronics Computer Science Department Luis Enrique Erro # 1, Santa María Tonantzintla, Puebla, Mexico, C.P. 72840 {ayaquica, fmartine, ariel}@inaoep.mx Abstract. The k-means algorithm is the most studied and used tool for solving the clustering problem when the number of clusters is known a priori. Nowadays, there is only one conceptual version of this algorithm, the conceptual k-means algorithm. One characteristic of this algorithm is the use of generalization lattices, which define relationships among the feature values. However, for many applications, it is difficult to determine the best generalization lattices; moreover there are not automatic methods to build the lattices, thus this task must be done by the specialist of the area in which we want to solve the problem. In addition, this algorithm does not work with missing data. For these reasons, in this paper, a new conceptual k-means algorithm that does not use generalization lattices to build the concepts and allows working with missing data is proposed. We use complex features for generating the concepts. The complex features are subsets of features with associated values that characterize objects of a cluster and at the same time not characterize objects from other clusters. Some experimental results obtained by our algorithm are shown and they are compared against the results obtained by the conceptual k-means algorithm. 1 Introduction The conceptual clustering problem was first addressed in the 80’s by Michalski [1]. It consists on finding, from a data set, not only the clusters but also a conceptual interpretation of them. Starting from the Michalski’s works several algorithms have been developed to solve the conceptual clustering problem. Some of them can be found in [2-8]. The k-means algorithm is the most studied and used tool for solving the clustering problem when the number of clusters is known a priori. The conceptual k-means algorithm proposed by Ralambondrainy [8] is the unique conceptual version of this algorithm. Then, we are going to focus in this algorithm. The conceptual k-means algorithm [8] was developed to solve problems where the number of clusters is known a priori. This algorithm consists of two phases: an aggregation phase, in which the clusters are built and a characterization phase, in which the concepts are generated. In the aggregation phase, the k-means algorithm was extended to work with mixed data. In order to solve the mixed data problem, a