Using the Group Genetic Algorithm for Attribute Clustering Tzung-Pei Hong Department of Science and Information Engineering, National University of Kaohsiung Kaohsiung, 811, Taiwan, R.O.C. tphong@nuk.edu.tw Feng-Shih Lin Department of Computer Science and Information Engineering, National Sun Yat-sen University Kaohsiung, 804, Taiwan, R.O.C. m983040076@student.nsysu.edu.tw Chun-Hao Chen Department of Computer Science and Information Engineering Tamkang University Taipei, Taiwan, R.O.C. chchen@mail.tku.edu.tw Abstract—In the past, the concept of performing the task of feature selection by attribute clustering was proposed. Hong et al. thus proposed several genetic algorithms for finding appropriate attribute clusters. In this paper, we attempt to improve the performance of the GA-based attribute-clustering process based on the grouping genetic algorithm (GGA). In our approach, the general GGA representation and operators are used to reduce the redundancy of chromosome representation for attribute clustering. At last, experiments are made to compare the efficiency of the proposed approaches and the previous ones. Keywords-attribute clustering, feature selection, genetic algorithm, grouping genetic algorithm, data mining. I. INTRODUCTION In data mining and machine learning, feature selection is an important pre-processing step [5][8]. A proper subset of features can not only reduce execution time of deriving rules [2], but also improve accuracy of classification. In order to conquer the curse of high dimensionality, some feature selection techniques have been proposed [1][17]. However, finding an optimal feature subset has been shown to be an NP- hard problem [3]. In 2007, Hong and Liou proposed a feature selection approach based on the concept of feature clustering [10]. Based on the same idea, Hong and Wang [13][14] then proposed the GA-based clustering methods for attribute clustering to find approximate feature subset for classification. However, as Falkenauer pointed out, the general GA had some weakness when solving the grouping problems [6]. Due to the encoding scheme, multiple chromosomes would map to the same attribute clustering result (feasible solution) due to the combinatorial property, thus causing a larger search space than needed. Thus, Falkenauer [6] proposed a group genetic algorithm (GGA) as a new evolution algorithm to ease the problem. GGA has the same workflow as GA, but uses different encoding schema and different operators. It has been testified that the efficiency of GGA is superior to GA in some areas especially on grouping problems [4]. In this paper, we thus propose a GGA-based attributed clustering approach. II. REVIEW OF RELATED WORK The purpose of feature selection is to find a proper subset of features that are relevant to the target concept. The dependency measure was used to estimate the similarity between each attributes which was proposed by Han et al. [9] and Li et al.[11]. Hong and Liou used the dependency measure in their attribute clustering based feature selection approaches [10]. The attributes which provide similar contribution to the classification have high dependency to each other. Hong and Wang [14] proposed a GA-based clustering method for attribute clustering to find approximate feature subsets for classification. They first proposed an approach which considered the average classification accuracy and the cluster balance of the attribute clusters, which was represented by chromosomes, as the fitness evaluation criteria. The fitness function adopted could get a good trade-off between accuracy and cluster balance. Many methods of using GAs to solve grouping problems have been proposed before [12]. Some problems, however, exist for the standard GA to solve grouping problems. Two main weaknesses of GAs on grouping problems are described below. First, a standard encoding scheme of GAs is highly redundant on grouping problems. The second weakness is that the classical GA crossover operator can’t ensure the inheritance property of the offspring from their parents. Since the traditional GA approach has some weaknesses as mentioned above when applied to the grouping problems, Falkenauer thus proposed the grouping genetic algorithm (GGA) to improve it [7]. Pankratz employs an adaptation of GGA for Vehicle Routing Problem [15], and Rekiek applied GGA on the Handicapped person transportation problem [16]. Falkenauer’s experiments results showed that GGA did better than GA on these problems [6]. Brown and Sumichrast [3] also did some empirical tests about the performance of GA and GGA in different domains. Their results also indicated GGA was superior to GA for big grouping problems. GGA and GA have nearly the same procedure. But GGA adopts a different encoding scheme and different genetic operators. In the following paragraph, we will briefly introduce GGA’s encoding scheme and genetic operators. In Falkenauer’s GGA representation, a chromosome consists of two parts, an object part and a group part. The object part stores the information about how the objects are grouped, and the group part is an ordered list of the groups. The object part is formed by a fixed-length string, with each gene in Identify applicable sponsor/s here. (sponsors) U.S. Government work not protected by U.S. copyright WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IEEE CEC