International Journal of Computer Applications (0975 8887) Volume 60No.1, December 2012 29 Knowledge Acquisition Tool for Classification Rules using Genetic Algorithm Approach Fadl Mutaher Ba-Alwi Faculty of Computer and IT Sana’a University – Yemen ABSTRACT Classification Rule Mining (CRM) is a data mining technique for discovering important classification rules from large dataset. This work presents an efficient genetic algorithm for discovering significant IF-THEN rules from a given dataset. The proposed algorithm consists of two main steps. First step generates set of classification rules and the second step deletes the weak rules and selects only the significant rules. Since weak rules are deleted and significant rules are selected, the proposed algorithm can be considered as knowledge acquisition tool for classification problems. Experimental results are presented to demonstrate the contribution of the proposed algorithm for discovering the significant rules. General Terms Knowledge Discovery in Databases (KDD), Data Mining, Genetic Algorithm, Machine Learning, Pattern Recognition. Keywords Classification rules; Genetic algorithm; Significant rule. 1. INTRODUCTION Data mining is a rapidly evolving art and science of discovering and exploiting new, useful, and profitable relationships in data that is awaking great interest in topics such as decision making, performance prediction, and many other applications [1]. Classification Rule Mining (CRM) is a data mining technique for discovering important classification rules from large dataset that is coupled with a set of pre- defined classes [2]. In the classification task, the discovered knowledge can be represented in different forms. The intuitively clear form for most users is the IF-THEN prediction rule. The IF-part (called the rule antecedent) contains a conjunction of m conditions on values of predictor attributes. The THEN-part (called the rule consequent) contains a prediction about the value of a class attribute. Several approaches for classification rules mining were in the machine learning literature [3-10]. Genetic Algorithm (GA) is a search technique that has been heavily used in different areas where the size of the search space is large. GA is based on the mechanics of natural selection and inspired on the principle of survival of the fittest, where the fittest individuals are selected to produce offspring for the next generation [11]. Selection, crossover and mutation are the basic genetic operators for generating offspring from the fittest individuals. Several GA approaches have been designed for discovering classification rules [11-19]. The contribution of this paper is the discovery of significant (novel) rules from large dataset using genetic algorithm approach. The proposed algorithm consists of two main steps. First step generates set of classification rules and the second step removes the weak rules and selects only the significant rules. Since weak rules are deleted and significant rules are selected, the proposed algorithm can be considered as knowledge acquisition tool for classification problems. This work is organized as follows. Section 2 describes the related works about significant rules. Section 3 is the detailed description of proposed method. Section 4 describes how the significant rules can be selected from a set of classification rules. Section 5 describes the computational results for the used dataset in the experiment and comparative study with existing techniques. Finally, section 6 concludes the paper. 2. RELATED WORK The major drawback of CRM is the large number of rules that may be generated [2]. Researchers use different measurements to select only important rules from all possible rules. In [4] the authors extract unexpectedness classification rules. In [5] the authors considered statistical quantitative rules (SQ rules) as a new category of rules. They proposed a permutation-based algorithm for discovering significant SQ rules. The problem addressed in [6] is how to efficiently select a limited number (k) of significant Association Rules (ARs) from the full set of classification ARs. The proposed algorithm in [7] applies a statistical significance test before accepting a pattern. A Numerous attempts have been made to apply GAs in data mining for knowledge discovery and classification. The CRM model with GA in [11] considers the characteristics of cloud computing. In [14] a classification algorithm based on GA approach presented to discover production rules in Conjunctive Normal Form (CNF) where a conjunctive relationship exists between two attributes and disjunction is there among the values of the same attribute. In [17, 20] the proposed GA for classification IF-THEN rules tried to avoid the drawbacks of creating randomly an initial population by creating initial population in a systematic way using the generalized Uniform Population (UP) method. 3. THE PROPOSED GA APPROACH The accuracy of the discovered classification rules by GA are more accurate than the rules obtained by the other classification algorithms [13]. GA works similar to what happens in nature as species evolve by natural selection. An initial population of individuals is generated randomly or in a systematic way. The individuals in the current population are encoded and evaluated according to fitness function. Individuals are selected according to their fitness to form a new population. 3.1 Individual Encoding GA has two styles for rule encoding namely Michigan and Pittsburgh. In the Michigan style each individual encodes a single rule, whereas in the Pittsburgh style each individual encodes a set of rules. The Michigan style is better to be used where the goal is to find a small set of accurate classification rules [3]. Therefore, the Michigan style is adopted for encoding in this paper.