Q-Surfing: Exploring a World Model by Significance Values in Reinforcement Learning Tasks Frank Kirchner 1,2 and Corinna Richter 1 Abstract . Reinforcement Learning addresses the problem of learning to select actions in unknown environments. Due to the poor performance of Reinforcement Learning in more complex and thus more realistic tasks with large state spaces and sparse reinforcement, much effort is done to speed up learning as well as on finding structure in problem spaces [11, 12]. Models are introduced in order to improve learning by allowing to plan on the internal world model. This implies that a directed exploration in the model is a very important factor in relation to better learning results. In this paper we present an algorithm which explores the model by computing so-called Significance Values for each state. Using these values for model planning, during early stages knowledge propagation is enhanced, during later stages values in important states retain higher values and might therefor be useful for future decomposition of state spaces. Empirical results in a simple grid navigation task will demonstrate this process. 1. INTRODUCTION Reinforcement Learning (RL) addresses the question of learning how to pick actions in order to maximize an externally given payoff. RL techniques are applied to problems in which the learner is given the ability to perform actions in an environment and in which it is given a sparse and/or delayed reward for these actions. Despite some initial and encouraging success, where RL techniques could even be applied to moderate complex domains [10], they scale poor in complex domains, where state spaces become too large to be exhaustively explored. As most research spaces, studied in AI are too large and cause a poor learning speed the improvement of RL techniques is the central aspect of the ongoing research. The resulting introduction of internal models [5, 6, 8] to the learner was a major innovation in order to improve and making use of gathered data and learning performance. The basic idea of these approaches is the general planning procedure of trying possible alternatives on the model instead of directly trying them in the world. Especially in the case of autonomous learning robots, where real world steps are costly and potentially 1 GMD, Schloss Birlinghoven, D-53754 Sankt Augustin, Germany 2 Northeastern University, Boston, MA, USA dangerous, this is a very important fact. The agent keeps an internal model of its knowledge, which is used for hypothetical action planning in order to improve real world behaviour. The way of how to explore the world model is the most important aspect of these methods and differs significantly among them. In this paper we present a new algorithm which explores the world model on basis of significant states. As knowledge is generally propagated in a wave-like process, and the basic idea of this algorithm is to stay on this wave of knowledge like a surfer, this method is called Q-Surfing [4]. Hypothetical steps are thus guaranteed to start within areas of already achieved knowledge and therefor improve the overall knowledge increase. Over the time few significant states in important world areas retain higher values as they are sensitive in relation to exploration and might therefor be useful in the future for state space decomposition. 2. REINFORCEMENT LEARNING RL algorithms are active learning methods, which enables a learning agent to select actions to maximize an externally given performance measure. The performance measure or so-called reward/penalty, is usually sparse and delayed. In the beginning of the learning process, when the agent has not yet received any payoff, the effect of applying actions and the resulting effects are unknown to the learner. The goal of learning is to find an optimal policy for the selection of actions, which when applied for the action selection, maximizes the future reward. Most RL techniques base on the estimation of value-functions that assess the utility of each individual state of the environment. The value-function represents the estimated future reward the learner receives upon executing the best available action in the corresponding state. Once a good or optimal value-function is known, it may be used to generate good or optimal action by a fast shallow search in the action space. Hence the problem of learning an optimal action selection policy is solved by learning optimal value-functions. In most currently available RL methods, the optimal value-function is approximated with dynamic programming techniques [2].