Robotics and Autonomous Systems 59 (2011) 910–922 Contents lists available at SciVerse ScienceDirect Robotics and Autonomous Systems journal homepage: www.elsevier.com/locate/robot Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives Minija Tamosiunaite a,b,∗ , Bojan Nemec c , Aleš Ude c , Florentin Wörgötter a a University Göttingen, Institute for Physics 3 - Biophysics, Bernstein Center for Computational Neuroscience, Friedrich-Hund-Platz 1, 37077 Göttingen, Germany b Vytautas Magnus University, Department of Informatics, Vileikos 8, 44404 Kaunas, Lithuania c Jožef Stefan Institute, Department of Automatics, Biocybernetics, and Robotics, Jamova 39, 1000 Ljubljana, Slovenia article info Article history: Received 1 July 2010 Received in revised form 30 June 2011 Accepted 4 July 2011 Available online 12 July 2011 Keywords: Reinforcement learning PI 2 -method Natural actor critic Value function approximation Dynamic movement primitives abstract When describing robot motion with dynamic movement primitives (DMPs), goal (trajectory endpoint), shape and temporal scaling parameters are used. In reinforcement learning with DMPs, usually goals and temporal scaling parameters are predefined and only the weights for shaping a DMP are learned. Many tasks, however, exist where the best goal position is not a priori known, requiring to learn it. Thus, here we specifically address the question of how to simultaneously combine goal and shape parameter learning. This is a difficult problem because learning of both parameters could easily interfere in a destructive way. We apply value function approximation techniques for goal learning and direct policy search methods for shape learning. Specifically, we use ‘‘policy improvement with path integrals’’ and ‘‘natural actor critic’’ for the policy search. We solve a learning-to-pour-liquid task in simulations as well as using a Pa10 robot arm. Results for learning from scratch, learning initialized by human demonstration, as well as for modifying the tool for the learned DMPs are presented. We observe that the combination of goal and shape learning is stable and robust within large parameter regimes. Learning converges quickly even in the presence of disturbances, which makes this combined method suitable for robotic applications. © 2011 Elsevier B.V. All rights reserved. 1. Introduction Dynamic movement primitives (DMPs) proposed by Ijspeert et al. [1] have become one of the most widely used tools for the generation of robot movements. Numerous applications can be found in the literature [2–5]. The DMP formalism is employed for describing goal-directed movements and includes second-order dynamics toward an attractor point, called the goal point g of the movement, as well as several adjustable parameters, which are used to obtain the desired shape of the trajectory. In this study, we will consider the questions of robot reinforcement learning using dynamic movement primitives. Several efficient methods have been proposed for DMP shape parameter learning. These include the natural actor critic (NAC ,[3]), policy improvement with path integrals (PI 2 ,[5]), and policy learning by weighting explorations with the returns, (PoWER, [4]). Using those methods, robots were trained to acquire specific skills, for example jumping across a gap ∗ Corresponding author at: University Göttingen, Institute for Physics 3 - Biophysics, Bernstein Center for Computational Neuroscience, Friedrich-Hund- Platz 1, 37077 Göttingen, Germany. Tel.: +49 370 687 37788; fax: +49 0 551 397720. E-mail addresses: m.tamosiunaite@if.vdu.lt (M. Tamosiunaite), bojan.nemec@ijs.si (B. Nemec), ales.ude@ijs.si (A. Ude), worgott@physik3.gwdg.de (F. Wörgötter). by a robot dog [5], hitting a baseball with a robot arm [3], or playing the ball-in-a-cup game using a humanoid robot [4]. Here we will consider the combination of DMP goal and shape learning. DMP goal learning was not much considered in robot experiments before [6,7] and the simultaneous combination of the two learning regimes is novel. The reason for this is that in most tasks considered so far the goal position is known well enough. Thus, goal learning is not required. There are, however, many tasks where this is not the case, which happen as soon as the goal has a hard-to-predict effect on the outcome. One example, which is also in the core of the current study, is pouring of a liquid. The complex turbulent motion of the liquid taking place at the rim of the container makes it very hard to predict at what position (=goal) the container should be optimally placed for best pouring results. The same is true for other dynamic tasks, like throwing objects to hit a predefined target [7] or placing one object on top of the other where the stable configuration of the two objects is not known in advance. Similar problems will arise when working with tools, e.g. hammering, where an arm would be stopped at some specific position letting the weight of the hammer do the final hitting. Goal and shape are also important when working with soft materials, e.g. when putting a table cloth on the table, goal position as well as shape of the swinging movement when unfolding the cloth will be important. 0921-8890/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.robot.2011.07.004