Linear Genetic Programming using a compressed genotype representation Johan Parent Vrije Universiteit Brussel IR, ETRO Pleinlaan 2 1050 Brussel, BELGIUM johan@info.vub.ac.Be Ann Now´ e Vrije Universiteit Brussel Faculty of Science, COMO Pleinlaan 2 1050 Brussel, BELGIUM asnowe@info.vub.ac.Be Kris Steenhaut Vrije Universiteit Brussel IR, ETRO Pleinlaan 2 1050 Brussel, BELGIUM kris@info.vub.ac.Be Anne Defaweux Vrije Universiteit Brussel Faculty of Science, COMO Pleinlaan 2 1050 Brussel, BELGIUM adefaweu@vub.ac.Be Abstract- This paper presents a modularization strategy for linear genetic programming (GP) based on a sub- string compression/substitution scheme. The purpose of this substitution scheme is to protect building blocks and is in other words a form of learning linkage. The compression of the genotype provides both a protection mechanism and a form of genetic code reuse. This pa- per presents results for synthetic genetic algorithm (GA) reference problems like SEQ and OneMax as well as sev- eral standard GP problems. These include a real world application of GP to data compression. Results show that despite the fact that the compression substrings as- sumes a tight linkage between alleles, this approach im- proves the search process. 1 Introduction Genetic algorithms (GA) and genetic programming (GP) search the space of possible solutions by manipulating the solution representation using genetic operators like crossover and mutation. Variable length representations al- low the structure of the solution to be evolved but are still limited by the genetic primitives used to build the solutions. If GA and GP are to address more complex problems it be- comes necessary to adapt the representation to the problem. Modularization adapts the representation by extending the set of genetic primitives with problem specific functionali- ties. In this text we explore a low level modularisation tech- nique. The presented compressed GA (cGA) attempts to identify usefull combination of genes in the population and makes these available as new primitives. The search process can then benefit from these new functionalities to progress more quickly through the search space. The cGA achieves modularisation by applying a compression operator to the population of candidate solutions. This paper is structured as follows Section 2 describes related work. In Section 3 the compression/substitution scheme is presented in detail. In Section 4 the different experiments are presented, each of them having different characteristics which impact on the performance of our al- gorithm. Results are presented in Section 5. 2 Related work The benefits of modularization for the GP search process are well known [10][1][7]. Modularization fosters code reuse on one hand, and on the other hand allows the GP system to identify and use high level functionalities. Different modu- larization strategies for tree based GP have been explored. Automatically Defined Functions (ADF) is the most promi- nent approach. It consists of a main tree which evolves to- gether with a predefined number of additional trees. Those additional trees can be called from the main tree and, as such, complement the set of primitives available to the GP system. An alternative method is encapsulation. It replaces a subtree by a new symbol and adds it to the primitive set of the system [6][10]. The symbol created in this way corre- sponds to a terminal/leaf node since it has no arguments. A third method, module acquisition, works in a similar fash- ion, but can create both function nodes (modules) and leaf nodes. If the depth of the selected subtree exceeds a cer- tain threshold the subtrees below this level are removed. In this case a function node will be created by adding an ar- gument for each removed subtree [2][3]. Although modu- larization has mainly been of interest to the GP community it is of course related to the search of building blocks in a GA. In [5] a module acquisition algorithm is presented which exploits modularity and hierarchy. Despite the use of a variable length representation, the problem of modularity is approached purely from a GA point of view. A module is defined as a set of gene values that correspond to a max- imum fitness for the individual. As in the work of [5] our algorithm relies on a substitution scheme but with notable differences in the substitution strategy. Another difference is that the compressed genotype scheme of the cGA was de- veloped with its use in a linear GP system in mind. The lin- ear encoding a GP program exhibits a much tighter linkage than is the case for standard GA problems. This assumption is important as it justifies the use of the simple compres- sion scheme which supposes a strong dependency between adjacent values. 3 A compressed GA This section presents the compressed genetic algorithm (cGA) which consist of a GA extended with a substitu- tion/compression based modularization mechanism. Figure 1 contains the pseudo code for cGA loop. The cGA adds a compression and a decompression step in the GA loop. Compression is applied to the individuals of the population prior to the creation of the next generation. As a result the genetic operators, 1-point crossover and 1-point mutation, are applied to a compressed genotype representations. Af- ter the creation of the next generation the individuals are decompressed so that their fitness can be evaluated. Compression is applied to the genotype to protect