Hyper-heuristic Decision Tree Induction Alan Vella, David Corne School of MACS, Heriot-Watt University Edinburgh, UK mail@alanvella.com, dwcorne@gmail.com Chris Murphy Motorola Ltd, Swindon, UK Chris.Murphy@motorola.com Abstract—Hyper-heuristics are increasingly used in function and combinatorial optimization. Rather than attempt to solve a problem using a fixed heuristic, a hyper-heuristic approach attempts to find a combination of heuristics that solve a problem (and in turn may be directly suitable for a class of problem instances). Hyper-heuristics have been little explored in data mining. Here we apply a hyper-heuristic approach to data mining, by searching a space of decision tree induction algorithms. The result of hyper-heuristic search in this case is a new decision tree induction algorithm. We show that hyper- heuristic search over a space of decision tree induction rules is able to find decision tree induction algorithms that outperform many different version of ID3 on unseen test sets. Keywords- data mining, hyper-heuristics, decision trees, evolutionary algorithm. I. INTRODUCTION Hyper-heuristics [1] are increasingly used in function and combinatorial optimization. The essential idea of hyper- heuristics is to search for an algorithm rather than for a specific solution to a given problem. From the viewpoint of evolutionary computation, a hyper-heuristic can be simply regarded as a sophisticated encoding. The genotype represents an algorithm, and when we interpret it, by running the algorithm on the given problem data, the result is a solution to a given problem instance. Hence we obtain a candidate solution via a “genotypealgorithm->candidate- solution” route, rather via a direct “genotype->candidate- solution” mapping. The interesting aspect of hyper-heuristics is the potential re-use of the algorithms that emerge from the search process. With appropriate experimental design (e.g., by using many problem instances in the initial hyper- heuristic training), new, effective and fast algorithms may be discovered that apply to a wide class of problem instances. The origin of this notion can be traced to Fisher and Thompson’s work [2], which investigated combinations of basic rules for job-shop scheduling. Other work pursued similar ideas, essentially re-discovering or extending [2] during the 1990s. Most of this continued to be in the area of job-shop scheduling. E.g. Fang et al [3] used evolutionary algorithms to evolve sequences of heuristics for job-shop and open-shop problems, while Zhang and Diettrich [4, 5] developed novel job-shop scheduling heuristics within a reinforcement learning framework. Another notable study was that of Gratch et al. [6], which used hill-climbing in a space of control strategies to find good algorithms for controlling satellite communication schedules. Hyper-heuristics have now been applied to a variety of problems, however these are almost exclusively in the area of combinatorial optimization, and therein the majority involve scheduling. A few examples outside scheduling include bin packing [7] and cutting-stock [8]. In bin-packing, for example, novel constructive algorithms were developed that outperformed standard bin-packing constructive heuristics over a wide range of unseen test instances. Very little work has so far explored the use of hyper- heuristics in data mining. Here, the task is invariably to find a classifier (which might be a decision tree, a set of rules, a neural network, etc…), which has good performance in classifying test data. In other words, this is a search in classifier space for a good classifier. To align this with the possibility of using hyper-heuristics, we can consider this as a search through the space of methods that build classifiers from training data. To date, one group has started to explore this idea. In Pappa and Freitas [9], grammar-based genetic programming is used to evolve rule induction algorithms, having presented the original idea in [10]. A broad category of rule induction algorithms operates via “sequential covering”: an initial rule is generated, covering some of the dataset, and additional rules are generated in order until the entire dataset is covered. There are several alternative ways to generate the initial and subsequent rules. E.g. we may start with a very general high-coverage (but low accuracy) rule, and add conditions until accuracy and/or coverage move beyond a threshold. Or, we may start with a very precise rule and gradually remove conditions. In [9], the encoding covers a vast space of possible ways to organize this process. Here we explore a hyper-heuristic approach to decision tree induction, by searching a space of decision tree induction algorithms. We use a simpler encoding than [9], essentially restricting the algorithm space to a single overall control structure. However we make more heuristic ‘components’ available, and hence explore a wider range of variants of a specialized class of algorithms. Simply put, the classic decision tree induction algorithm builds a tree step by step by deciding, at each step, how to develop the next node in the tree. This comes down to choosing a specific attribute in the dataset. E.g., if the chosen attribute is “gender”, then the current node will have two children, one for the case “gender = male” and another for “gender = female”. The choice of attribute is made by using a heuristic, which tests how well each proposed attribute is at discriminating between values of the target class. Our method is to search a space of rulesets, where individual