Planning using online evolutionary overfitting Spyridon Samothrakis and Simon M. Lucas, Senior Member, IEEE Abstract— Biological systems tend to perform a range of tasks of extreme variability with extraordinary efficiency. It has been argued that a plausible scenario for achieving such versatility is explicitly learning a forward model. We perform a set of experiments using the original and a modified version of a classic reinforcement learning task, the mountain car problem, using a number of agents that encode both a direct and an abstracted version of a forward model. The results suggest that superior performance can be achieved if the forward model can be exploited in real-time by an agent that has already internalised a model-free control function. I. I NTRODUCTION The idea that the brain encodes a model of reality and uses that model to reason about the world has a long history in almost all fields of science that relate to action-selection and animal/agent behaviour. In classical Artificial intelligence (A.I.), although never explicitly stated, a model of the world is used to infer future rewards (ex. Min-max, see [1]) based on some tree- based search method. The power of classical A.I. to find near optimal solutions given a model (and enough computa- tional power) has recently been exposed in a popular game competition[2], where the winning agents encoded a perfect model of reality (the Mario world in this case) and used A * [3] to reason using that model. More recently, theories of brain function put forward by Friston et al.[4], [5] require that the brain encodes a generative encoding of its sensory input, as an indispensable part of its function. The generative model complements the brain’s recognition model, and all action stems from the fact that an agent tries to reconcile its internal world model with reality, i.e. minimise its prediction error. Another successful effort of sensory input modelling has been Dynamic Adaptive Control(DAC)[6]. DAC was par- tially created to show how behaviour can emerge from the interaction of different cortical areas. It defines a generic multi-component agent architecture, with one of the com- ponents responsible for predicting future sensory inputs. In a hybrid wall avoidance/phototaxis task, it was shown that agents that specifically try to predict their future manage to minimise the entropy of their sensory input, i.e. have more stable trajectories. A more abstract line of thinking concerning the idea of encoding forward models comes from cybernetics[7] and neural networks[8]. The ideas in these papers predate DAC and the free energy formulation, and expose the theoretical benefits of a forward model. Spyridon Samothrakis and Simon M. Lucas are with the School of Computer Science and Electronic Engineering, University of Essex, Colchester CO4 3SQ, United Kingdom. (emails: ssamot@essex.ac.uk, sml@essex.ac.uk). From the field of reinforcement learning, a number of model based systems have been proposed, most notably Dyna-Q[9] and Dyna-Q2[10]. The approach followed in these papers has more to do with using the model to do planning than to correct possible corrupted input. In Evolutionary biology, it has been argued that a possible reason behind the development of the mind was the ability of its carrier to perform mental simulations of possible future events, thus allowing the agent to “test” its theories about the future[11] without physical harm. From adaptive behaviour, there is a number of publications (ex. [12], [13]) that exploit the model as part of a setup called “anticipatory” behaviour, that are strongly focused on using internal planning to guide the actions for an agent. Finally, from a neuroscientific perspective, all the above uses of a forward model 1 are discussed in [15]. For the paper’s main experiment, a number of participants where asked to move their hands in total darkness and asess the new position. The experimental data collected can be accounted for by using a kalman filter[16]. This supports the view that indeed a forward model is encoded by the Central Nervous System, as a kalman filter requires such a model. In this paper we revisit the subject of sensory/state input modelling in a somewhat ad-hoc manner, using methods from both Artificial life and classical machine learning to create an agent that reasons using a model of its environment. We claim three contributions. First, we show that one can use a perfect forward model in order to boost the perfor- mance of an agent after reactive/adaptive learning(achieving the best published results to date - as far as we know - in a classic reinforcement learning task, the mountain car problem, see the results section). Second, we show that that an agent that has internalised a model of the environment can achieve better performance than an agent that simply encodes a policy, even if the model is imperfect, as long as it tries to reconcile the error between the internal model and the real world. Finally, we treat reactive control as merely a strong prior over reflective control, which might possibly allow one to gracefully increase the quality of an agent in an online fashion. The rest of the paper is organised as follows: A back- ground section introduces the world the agent is embodied in alongside the technical components that make up each agent. A methodology section where we describe how the aforementioned components are linked to create agents; A results section, where we present our findings, and finally a 1 The term “forward model” is overloaded, see [14] for more details. In the particular experiment peformed in this paper, sensory and state inputs are the same, so we can avoid further clarification