Multiple-Target Reinforcement Learning with a Single Policy Marc Peter Deisenroth marc@cs.uw.edu Department of Computer Sciene & Engineering, University of Washington, Seattle, WA, USA Dieter Fox fox@cs.uw.edu Department of Computer Sciene & Engineering, University of Washington, Seattle, WA, USA Abstract We present a reinforcement learning ap- proach to learning a single, non-hierarchical policy for multiple targets. In the context of a policy search method, we propose to de- fine a parametrized policy as a function of both the state and the target. This allows for learning a single policy that can navigate the RL agent to different targets. Generalization to unseen targets is implicitly possible while avoiding combining local policies in a hierar- chical RL setup. We present first promising experimental results that show the viability of our approach. 1. Introduction Fully autonomous reinforcement learning (RL) often requires many trials to successfully solve a task (e.g., Q-learning) or learning requires a good initialization (e.g., by imitation (Abbeel & Ng, 2005)) and/or a deep understanding of the system. If this knowledge is not available, be it due to lack of understanding of highly complicated dynamics or because a solution is sim- ply not known, data-intensive learning methods are required. In a robotic system, however, many physical interactions are often infeasible and lead to worn-out robots. The more fragile a robotic system the more important data-efficient learning methods are. Generally, model-based methods, i.e., methods that learn a dynamics model of the environment, are more promising to efficiently extract valuable information from available data than model-free methods such as Q-learning or TD-learning. One reason why model- based methods struggle when learning from scratch is that they suffer from model errors, i.e., they inherently assume that the learned dynamics model sufficiently accurately resembles the real environment, see, e.g., (Schneider, 1997; Schaal, 1997). Model errors are es- pecially an issue when only a few samples are available or only uninformative prior knowledge about the task to be learned is known. Hence, most model-based RL methods assume a pre-trained dynamics model, ob- tained through motor babbling, for instance (Ko et al., 2007), which is sample inefficient. Pilco (probabilistic inference and learning for con- trol) is a model-based policy search framework that sidesteps these issues by employing probabilis- tic dynamics models to account for model uncer- tainties (Deisenroth, 2010; Deisenroth & Rasmussen, 2011). The dynamics models are implemented as flex- ible non-parametric Gaussian processes (GPs) (Ras- mussen & Williams, 2006). The key to its data- efficiency is that pilco incorporates model uncertain- ties in a principled way by integrating them out dur- ing planning and decision making. This allows pilco to jointly learn good controllers and dynamics mod- els from scratch using only a few interactions with the physical system. Using only general prior information, pilco achieves an unprecedented speed of learning tasks from scratch (Deisenroth & Rasmussen, 2011). The controllers learned by pilco drive the system to a desired target state. Reasonable generalization to previously unseen targets is impossible. In this paper, we extend pilco to jointly deal with multiple targets during policy learning. During training, the learner has access to a small set of targets and learns a single controller jointly for all targets. We achieve gener- alization to unseen targets (in the same domain) by defining the policy as a function of both the state and the target. At test time, this allows for generalization to unseen targets without retraining. Our approach differs from hierarchical RL, where local policies are combined. Often, the local policies are trained independently (Taylor & Stone, 2009). Cross- domain RL proposed in (Taylor & Stone, 2007) aims at transferring a policy learned for a task to a related task in a different domain using rule transfer.