Learning to Drive a Real Car in 20 Minutes Martin Riedmiller Neuroinformatics Group, Univ. of Osnabrueck Email: martin.riedmiller@uos.de Mike Montemerlo, Hendrik Dahlkamp AI Lab, Stanford University Email: {montemerlo, dahlkamp}@stanford.edu Abstract The paper describes our first experiments on Reinforce- ment Learning to steer a real robot car. The applied method, Neural Fitted Q Iteration (NFQ) is purely data-driven based on data directly collected from real-life experiments, i.e. no transition model and no simulation is used. The RL approach is based on learning a neural Q value function, which means that no prior selection of the structure of the control law is required. We demonstrate, that the controller is able to learn a steering task in less than 20 minutes di- rectly on the real car. We consider this as an important step towards the competitive application of neural Q function based RL methods in real-life environments. 1 Introduction The interest in applying Reinforcement Learning (RL) methods to real life control applications is growing rapidly e.g. [7], [14], [9], [5]. In this paper we focus on situations, where the controller should learn by interacting with the real system only. In particular, for the design of the controller we will not assume, that a system model is available; nei- ther in form of system equations nor in form of a simulator (the latter approach was successfully applied in a number of applications, see e.g. [4], [7]). In contrast to that, here, we only assume that the controller is able to collect state action transitions by observing the real system behaviour while controlling it. Learning by interacting with the real system directly has an important advantage: the controller is tailored exactly to the behaviour of the real system at hand instead of a more or less exact model of it. The big challenge in learning with real systems lies in the fact that learning must occur in a rea- sonable amount of time with a reasonable effort: in a real application, one typically can not wait for hundered thou- sands of episodes, until a controller is learned. Another important aspect of this paper, is, that we do not need prior knowledge about the policy to be learned. This has the advantage, that we do not constrain the control law Figure 1. The car used is a VW Passat, equipped with additional sensors. a priori to a certain class of controllers. Additionally, the proposed approach is applicable even in situations, where a priori no idea about a working control law is available. In principle, value function based methods offer the ad- vantage of a very flexible representation of the control pol- icy to be learned. However, in their original on-line learning form, this advantage comes at the cost of very long training times, which makes them unrealistic for real-life applica- tions. Recently, memory based RL approaches have been proposed, that make approximate model-free value itera- tion algorithms much more efficient by explicilty memo- rizing and reusing transition information. One of them is Neural Fitted Q Iteration (NFQ) [11]. NFQ stores all tran- sition tuples (state, action, successor state) seen so far and reuses them in every update step of the Q-function. This compensates an essential problem of ’non-local’ function approximators: output values at points in input space, that are currently not updated can be deteroriated or ’forgot- ten’. Combining this explicit memorization of data points with the otherwise good generalisation abilities of multi- layer perceptrons leads to a model-free, Q-value function based RL approach, that is highly efficient with respect to the amount of training data needed. NFQ can be seen as an instantiation of the familiy of Fitted Q Iteration algorithms [1], which themselves are a special kind of Fitted Value It- eration algorithms [3], [8]. FBIT 2007