Discovery of Classification Rules using Genetic Algorithm with non-random Population initialization Priyanka Sharma CSE Department G. J. U. S. & T. Hisar e-mail: pinki.sharma2912@gmail.com Abstract— Goal of classification technique is to predict the class which an instance of dataset belongs to. Discovered knowledge is then presented in the form of high level, easy to understand classification rules. Genetic algorithm has been widely adopted and applied for discovery of classification rules. The main criticism of employing genetic algorithms in data mining applications is local convergence and algorithm may become a random walk in initial runs. One solution to this problem is giving a filtering bias to initial population such that more significant attributes get initialized with higher probability as compared to less significant attributes. This paper proposes a genetic algorithm with non-random population initialization. Each attribute in the initial population is initialized with a probability proportional to its entropy such that more the entropy less significant the attribute is and, then a survival probability factor is also considered to make the population better. Relevant attributes occurring more frequently in the initial population provides a good start for GA to search for better fit rules at earlier generations and thus time utilization is noted. I. INTRODUCTION This work presents a system based on genetic algorithms (GAs) [1][2][3] to entertain the classification task. Classification [4][5] is the process of finding a model which describes and distinguishes data classes or concepts, for the purposes of being able to use the model to predict the class of objects whose class label is unknown. The use of GAs in classification is an attempt to effectively exploit the large search space usually associated with classification tasks. The motivation for applying GAs to data mining is that they are robust, adaptive search techniques that perform a global search in the solution space [6][7][8]. Rule mining using GAs [9][10][11] make use of fitness function and evolutionary operators to generate most interesting, comprehensible and strong classification rules [12][13][14][15]. Classification rule use the most expressive and human readable representation for hypothesis. They are IF- THEN rule where IF part states a condition over the data and THEN part includes a class label. Very often GA starts its search with a randomly initialized population. Random initialization tends to be ineffective in case of rule mining because randomly initialized population leads to slow convergence and sometimes algorithm can just become a random walk in in initial runs if randomly generated population doesn’t produce good rules. As rules are generated from randomly initialized population, their accuracy will be poorer in comparison to the rules generated by the pop which use domain knowledge [16][17][18]. This paper proposes a genetic algorithm approach for discovery of classification rules using genetic algorithm with non-random population initialization to bias the initial population towards more significant or informative attributes so that the GA starts with good rules covering relatively more training instances. In data mining applications with large datasets, the approach has been adopted to evolve better fit rules in lesser time, thereby, significantly enhancing the performance of evolutionary rule mining process. The proposed approach is also anticipated to discover overall better fit and optimal rule set. The rest of the paper is organized as below. Section II describes design of the GA in terms of initializing the population, evaluating the fitness of candidate rules and genetic operators employed. This section illustrates how entropy of individual and survival probability is used to bias the initial population, a measure of fitness to evaluate the goodness of rules and genetic operators employed in rule mining with example and a good explanation for convenience of reader. The proposed genetic algorithm is worked upon 10 datasets publically available from UCI [19] and KEEL [20] data repository in section III and simulation results are compared with GA with random initialization. Section IV concludes the paper and points to future scope of this work. II. THE PROPOSED GENETIC ALGORITHM DESIGN The proposed GA in this work follows Michigan’s approach to represent rules. We have used GA wih crowding to avoid convergence to a single best rule. Rules are discovered as Production Rules in the form: ‘If <Condition> Then <Conclusion>’. Classification rules are high level symbolic rules and are considered comprehensible. The remaining details of the GA are as given below. International Journal of Artificial Intelligence and Knowledge Discovery Vol.4, Issue 3, July 2014 Print-ISSN: 2231-2021 e-ISSN: 2231-0312 24