CST: Constructing Skill Trees by Demonstration George Konidaris gdk@csail.mit.edu MIT CSAIL, 32 Vassar Street, Cambridge MA 02139 USA Scott Kuindersma scottk@cs.umass.edu Roderic Grupen grupen@cs.umass.edu Andrew Barto barto@cs.umass.edu Computer Science Department, University of Massachusetts Amherst, Amherst MA 01003 USA Abstract We describe recent work on CST, an online algorithm for constructing skill trees from demonstration trajectories. CST segments a demonstration trajectory into a chain of component skills, where each skill has a goal and is assigned a suitable abstraction from an abstraction library. These properties per- mit skills to be improved efficiently using a policy learning algorithm. Chains from mul- tiple demonstration trajectories are merged into a skill tree. We describe applications of CST to acquiring skills from human demon- stration in a dynamic continuous domain and from both expert demonstration and learned control sequences on a mobile manipulator. 1. Introduction Learning from demonstration (or LfD) (Argall et al., 2009) offers a natural and intuitive approach to robot programming: rather than investing effort into writing a detailed control program, we simply show the robot how to achieve a task. LfD has received a great deal of attention in recent years because it aims to facilitate ubiquitous general-purpose automation by removing the need for engineering expertise and instead enabling the direct use of existing human procedural knowledge. This paper summarizes recent work on CST, an LfD algorithm with four properties which, taken together, distinguish it from previous work. First, rather than converting a demonstration trajectory into a single controller, CST segments demonstration trajectories into a sequence of controllers (which we term skills, but Appearing in Proceedings of the ICML Workshop on New Developments in Imitation Learning, Bellevue, WA, USA, 2011. Copyright 2011 by the author(s)/owner(s). are also called behaviors or motion primitives). This aims to extract reusable components of the demonstra- tor’s behavior. Second, CST extracts skills which have goals —in particular, the objective of skill n is to reach a configuration where skill n +1 can be successfully ex- ecuted. Such skills can be refined by the robot using policy improvement algorithms. Third, CST option- ally supports skill-specific abstraction selection, where each skill policy is defined using only a small number of relevant state and motor variables. This affords ef- ficient representation and learning, facilitates transfer, and enables the acquisition of policies that are high- dimensional when represented monolithically but con- sist of subpolicies that can be individually represented using far fewer state variables. Finally, CST merges skill chains from multiple demonstrations into a skill tree, allowing it to deal with collections of trajectories that use different component skills to achieve the same goal, while also determining which trajectory segments are instances of the same policy. 2. Background This work adopts the options framework—a hierar- chical reinforcement learning formalism for learning and planning using temporally extended actions or op- tions —for modeling acquired skills. An option, o, consists of three components: an option policy, π o , giving the probability of executing each ac- tion in each state in which the option is defined; an ini- tiation set indicator function, I o , which is 1 for states where the option can be executed and 0 elsewhere; and a termination condition, β o , giving the probability of option execution terminating in states where the op- tion is defined. Given an option reward function (of- ten just a cost function with a termination reward), determining the option’s policy can be viewed as just another reinforcement learning problem, and an appro-