Autonomous Hierarchical Skill Acquisition in Factored MDPs Christopher M. Vigorito and Andrew G. Barto Department of Computer Science University of Massachusetts Amherst Amherst, MA 01002 {vigorito,barto}@cs.umass.edu Abstract— Learning hierarchies of reusable skills is essential for efficiently solving multiple tasks in a given domain. Un- derstanding the causal relationships between one’s actions and various dimensions of one’s environment can facilitate learning of abstract skills that may be used subsequently in related tasks. Using Bayesian network structure-learning techniques and structured dynamic programming algorithms, we show that reinforcement learning agents can learn incrementally and autonomously both the causal structure of their environment and useful skills that exploit this structure. As new structure is discovered, more complex skills are learned, which in turn allow the agent to discover more structure, and so on. Because of this bootstrapping property, our approach can be considered a developmental process that results in steadily increasing domain knowledge and behavioral complexity. I. I NTRODUCTION Much research in reinforcement learning (RL) has focused on efficient learning of optimal behavior policies for single sequential decision tasks in a given domain [1], [2]. The body of literature applying RL to ensembles of related tasks in the same or similar domains is considerably smaller. Part of the reason for this is the difficulty of defining relatedness between tasks. Without providing a strict definition, we adopt the notion that two or more tasks are related if the transition dynamics of their domains are either identical or overlap considerably in terms of their structure. In the former case, the tasks would only differ in their reward functions, while in the latter it is assumed that there are certain aspects of the dynamics that are common among the tasks. These commonalities can often be exploited to learn policies for each task more efficiently than by learning each task from scratch [3]. An essential component of learning systems designed for solving ensembles of tasks efficiently is a mechanism for representational abstraction. That is, agents must be able to compactly represent policies and models of skills in order to learn feasibly a library of skills that can be reused in multiple tasks to solve similar sub-problems. If the representations for each skill is sparse in the sense of being defined only over relevant environmental variables, a skill can be applied in multiple contexts that differ along irrelevant dimensions without having to relearn the skill in each of those contexts. Abstract representations like this greatly facilitate learning of such skills as well, since the number of relevant variables is generally much smaller than the total number of environmental variables, greatly reducing the amount of experience and computation needed to find good policies. Hierarchy is also a necessity in such systems, allowing more abstract skills to make use of lower level skills as atomic actions without concern for the details of their execution. This facilitates learning of complex skills as well as planning at multiple levels of abstraction. If an agent can construct a useful hierarchy of abstract skills in a given environment, then the search space of policies for similar tasks within that environment effectively shrinks. This is because selecting between alternative abstract actions allows the agent to take larger, more meaningful steps through the search space of policies than does selecting between more primitive actions [4]. In the framework presented here we focus on the model- based approach whereby an agent accumulates knowledge of the dynamical structure of its environment as it explores. Using this structural knowledge, the agent incrementally generates abstract skills, each composed of a policy for reliably changing certain aspects of its environment and a compact model representing the long term effects that skill has on the environment. As these skills are added to the agent’s skill set, they become available as primitive actions to be used when computing policies and models of more complex skills. This bootstrapping scenario of steadily increasing behavioral complexity built upon existing knowledge and behavioral repertoires can be considered an instance of an autonomous developmental learning system [5]. As an agent in this framework continues to add new skills to its behavioral repertoire it becomes more of an expert at manipulating its environment. This increasing behavioral expertise is essentially the high-level goal of an agent in our approach. The following section describes our formalism for this framework and presents relevant background material. In particular, we make the assumption that an agent’s envi- ronment can be modeled as a Markov Decision Process (MDP), more specifically a factored MDP, both of which are discussed below. We use incremental Bayesian network learning techniques [6] to accumulate structural knowledge of the environment and, given this knowledge, employ struc- tured dynamic programming methods [2], [7] to compute abstract, closed-loop control policies and their corresponding models in the form of options, the formalization of skills we adopt.