Efficient Abstraction Selection in Reinforcement Learning (Extended Abstract) Harm van Seijen Department of Computing Science University of Alberta Edmonton, Canada Shimon Whiteson Informatics Institute University of Amsterdam Amsterdam, The Netherlands Leon Kester Distributed Sensor Systems Group TNO Defence, Security and Safety The Hague, The Netherlands Abstract This paper introduces a novel approach for abstraction selec- tion in reinforcement learning problems modelled as factored Markov decision processes (MDPs), for which a state is de- scribed via a set of state components. In abstraction selection, an agent must choose an abstraction from a set of candidate abstractions, each build up from a different combination of state components. 1 Introduction In reinforcement learning (RL) (Sutton and Barto 1998; Szepesv´ ari 2010), an agent learns a control policy by inter- action with an initially unknown environment, described via a set of states, while trying to optimize the (sum of) rewards it receives, resulting from its actions. An RL problem is typ- ically modelled as a Markov decision process (MDP) (Bell- man 1957). One of the main obstacles for learning a good policy is the curse of dimensionality: the problem size grows expo- nentially with respect to the number of problem parameters. Consequently, finding a good policy can require prohibitive amounts of memory, computation time, and/or sample expe- rience (i.e., interactions with the environment). Fortunately, many real-world problems have internal structure that can be leveraged to dramatically speed learning. A common structure in factored MDPs (Boutilier, Dear- den, and Goldszmidt 1995), wherein each state is described by a set of state component values, is the existence of ir- relevant (or near-irrelevant) state components, which affect neither the next state nor the reward. Removing such com- ponents can results in a dramatic decrease in the state space size. Unfortunately, in an RL setting, where the environment dynamics are initially unknown, learning which components are irrelevant is a non-trivial task that typically requires a number of statistical tests that depends on the size of the full state space (see for example (McCallum 1995)). More recently, methods have emerged that focus on se- lecting the best abstraction, a subset of state components, from a set of candidate abstractions (Diuk, Li, and Leffler 2009; Konidaris and Barto 2009). The complexity of these methods depends only on the size of the abstractions used, Copyright c 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. which can be exponentially smaller than the full state space. The existing methods treat abstraction selection as an in- stance of model selection. Consequently, an abstraction is evaluated by measuring how well it predicts the outcome of an action, using some statistical measure. In an RL setting, the model selection approach has a num- ber of disadvantages. First, it does not take into account the on-line nature of RL, which requires the agent to balance exploration and exploitation. In order to effectively balance exploration and exploitation, it is important to know which abstraction is currently the best, given the samples observed so far. For example, small, fast-learning abstractions might be preferred in the early learning phase, while larger, more informative abstractions might be preferred later on. This creates a fundamental conflict with model selection, which is based on the premise that there is a single best abstrac- tion that needs to be found. Second, an abstraction that is selected on the basis of the accuracy of its predictions is not guaranteed to be the abstraction that results in the most re- ward; an abstraction can be way off in its predictions, as long as it correctly guesses what the best actions are, it will results in high total reward. We introduce a new, intuitive approach for abstraction se- lection that avoids the disadvantages of model selection. Our approach evaluates an abstraction by using the abstraction for action selection for a certain period of time and determin- ing the resulting rewards. To maintain accurate estimates for the different abstractions, the agent needs to switch the ab- straction it uses for action selection frequently. A key insight behind our approach is that an agent that has to choose be- tween abstractions faces a similar exploration-exploitation dilemma as when choosing between its actions. Therefore, we formalize the task by introducing internal actions that al- low the agent to switch between the different abstractions. The value of an internal action, which estimates the sum of future rewards, can be updated using regular RL methods. We call the derived task that includes the switch actions the abstraction-selection task. If the Markov property holds for this derived task, which states that the outcome of an action only depends on the current state and not on the history, the derived task is an MDP itself. In this case, convergence is guaranteed to the abstraction that is asymptotically the best, as well as to the optimal policy of that abstraction.