Extracting Grammar from Programs: Evolutionary Approach Matej ˇ Crepinˇ sek 1 , Marjan Mernik 1 , Faizan Javed 2 , Barrett R. Bryant 2 , and Alan Sprague 2 1 University of Maribor, Faculty of Electrical Engineering and Computer Science, Smetanova 17, 2000 Maribor, Slovenia {matej.crepinsek, marjan.mernik}@uni-mb.si 2 The University of Alabama at Birmingham, Department of Computer and Information Sciences, Birmingham, AL 35294-1170, U.S.A. {javedf, bryant, sprague}@cis.uab.edu Abstract. The paper discusses context-free grammar (CFG) inference using genetic-programming with application to inducing grammars from programs written in simple domain-specific lan- guages. Grammar-specific heuristic operators and non-random construction of the initial popu- lation are proposed to achieve this task. Suitability of the approach is shown by small examples where the underlying CFG’s are successfully inferred. Keywords. Grammar induction, Grammar inference, Learning from positive and negative ex- amples, Genetic programming 1 Introduction In the accompanying paper [15] we discussed the search space of regular and context-free grammar inference. The conclusion reached was that owing to the large search space, the exhaustive (brute-force) approach to grammar induction could only be applied to small positive samples. Hence, a need for a different and more efficient approach to explore the search space arose. Evolutionary computation [16] is particulary suitable for such kinds of problems. In fact, genetic algorithms have already been applied to the grammar inference problem, with varying results. In this paper another evolutionary approach, Genetic Programming (GP), to CFG learning is presented. Genetic programming [3] is a successful technique for getting computers to automatically solve problems. It has been successfully used in a wide variety of application domains such as data mining, image classification and robotic control. In general, genetic programming works well for problems where solutions can be expressed with a modestly short program. For example, methods working on typical data structures such as stacks, queues and lists have been successfully evolved using genetic programming in [4]. Specifications (BNF) for domain- specific languages are small enough so that we can expect that a successful solution can be found using genetic programming. Our previous work [6] was successful in inferring small context-free grammars from positive and negative samples. This paper elaborates on our recent research findings and builds on our previous work. 2 Related Work The impact of different representations of grammars was explored in [14] where experimental results showed that an evolutionary algorithm using standard context-free grammars (BNF) outperforms those using Greibach Normal Form (GNF), Chomsky Normal Form (CNF) or bit-string representations [5]. This performance differential was attributed to the larger grammar search space of the other represen- tations, which was a consequence of them having a more complex grammar form. The experimental assessment in [14] was very limited due to the large processing time (processing of one generation had taken several hours; using our system, processing of one generation takes just few seconds). This was due to use of the chart parser, which is used commonly in natural language parsing and can accept ACM SIGPLAN Notices 39 Vol. 40(4), Apr 2005