Towards a Framework for the Empirical Analysis of Genetic Programming System Performance Oliver Flasch and Thomas Bartz-Beielstein Cologne University of Applied Sciences, {oliver.flasch, thomas.bartz-beielstein}@fh-koeln.de Abstract. This chapter introduces the basics of a framework for sta- tistical sound, reproducible empirical research in Genetic Programming (GP). It provides tools to understand GP algorithms and heuristics and their interaction with problems of varying diﬃculty. Following a rigorous approach where scientiﬁc claims are broken down to testable statistical hypotheses and GP runs are treated as experiments, the framework helps to achieve statistically veriﬁed results of high reproducibility. A proto- typic software-implementation based on the R environment automates experiment setup, execution, and analysis. The framework is introduced by means of an example study comparing the performance of a refer- ence GP system (TinyGP) with a successively more complex variants of a more modern system (GMOGP) to test the intuition that complex problems require complex algorithms. Keywords: Genetic Programming, Symbolic Regression, Design of Ex- periments, Sequential Parameter Optimization, Reproducible Research, Multi-Objective Optimization 1 Introduction The goal of this chapter is to introduce a framework for the systematic empirical analysis of Genetic Programming (GP) system components and their inﬂuence on GP system performance. Main ideas and concepts are borrowed from the empirical approach to research in evolutionary computation described in [1]. Performing a thorough and statistically well-founded experimental analysis pro- vides valuable insight into the behavior of GP components. In current GP research, much repeated work in experimental planning, setup and result analysis is required when proposing improvements in GP system components such as selection and variation operators, individual representations, or general search heuristics. To measure the performance beneﬁt of an improved GP system component, a set of test functions has to be implemented, GP system parameters, both of the system under study as well as of comparison systems have to be chosen, experiments have to be designed and code for the statistical result analysis has to be written. As obtaining results of statistical signiﬁcance often requires many independent runs, infrastructure for distributed execution is often necessary to render the implementation of an experiment plan practical.