2475-1502 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TG.2018.2806007, IEEE Transactions on Games IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. ?, NO. ?, ? 2017 1 Exploration in Continuous Control Tasks via Continually Parameterised Skills Michael Dann, Fabio Zambetta, John Thangarajah Abstract—Applications of reinforcement learning to continuous control tasks often rely on a steady, informative reward signal. In videogames, however, tasks may be far easier to specify through a binary reward that indicates success or failure. In the absence of a steady, guiding reward, the agent may struggle to explore efficiently, particularly if effective exploration requires strong coordination between actions. In this paper, we show empirically that this issue may be mitigated by exploring over an abstract action set, using hierarchically composed parameterised skills. We experiment in two tasks with sparse rewards in a continuous control environment based on the arcade game, Asteroids. Compared to a flat learner that explores symmetrically over low-level actions, our agent explores a greater variety of useful actions, and its long-term performance on both tasks is superior. Index Terms—Reinforcement Learning, Parameterised Skills, Computer Games I. I NTRODUCTION R EINFORCEMENT learning methods for continuous con- trol tasks have improved significantly in recent times [1], [2], [3]. Nonetheless, one issue that remains problematic in continuous domains is the question of how to explore; that is, how best to deviate from the current strategy in order to enable discovery of superior strategies. In practice, this issue is often mitigated by specifying the task through an informative reward signal, i.e. one that strongly shapes the agent’s behaviour by providing evaluative feedback after every time step. However, some tasks exist that are far simpler to specify via a binary reward for success or failure. Such tasks are quite common in videogames. For example, in a game involving obstacle avoidance, it is far easier to define a negative reward for colliding than to prescribe the exact path that the agent should follow. Unfortunately, in the absence of informative rewards, the agent may struggle to learn anything through na¨ıve explo- ration. For example, consider a humanoid robot attempting a rollerskating task. Suppose that the aim of the task is to navigate to a target 10 metres away, and that there is a single positive reward upon success. If the robot’s exploration strategy is just to flex its joints in a random, uncoordinated manner, it will struggle to move away from the start location and thus rarely, if ever, receive any positive feedback. A natural way to address this issue is through hierarchical learning [4], [5], [6]. For example, suppose that the roller- skating agent is first taught a parameterised skill [7], [8], [9] for skating in any chosen direction at any possible speed. If M. Dann, F. Zambetta and J. Thangarajah are with the School of Science, RMIT University, Australia. it then explores by randomly varying its velocity, as opposed to randomly flexing its joints, it will reach a wider range of locations and be more likely to discover the reward. This usage of skills differs subtly from their traditional us- age in facilitating knowledge transfer and temporal abstraction. To give an example of traditional use, da Silva et al. describe a robot that is tasked with moving objects around a warehouse [9]. If the robot is equipped with a parameterised “pick up object” skill that can handle objects of different shapes and sizes, it will not have to be retrained when it encounters new objects in the future. Furthermore, it can decompose long-term tasks into sequences of subtasks (e.g. pick up object → move object → put down object). Note that the skill parameters in this example (the size and shape of the object) are fixed for the duration of each subtask. By contrast, in our rollerskating example, the agent’s target velocity is continually updated. The “skate at specified velocity” skill does not introduce temporal abstraction. Instead, the agent still acts at the most granular time scale, but it learns over an abstract action space where it does not have to be concerned with low-level control. To the best of our knowledge, there is no previous ex- perimental work that compares hierarchical learning via con- tinually parameterised skills versus ordinary “flat” learning. Masson et al. envisaged our type of approach, and trained continuous policies over parameterised “shoot” and “dribble” actions in a soccer environment [10]. However, in their work there was no primitive, underlying action space by which to compare learning over, as their focus differed from ours. 1 Accordingly, the contribution of this paper is best described as a novel application of existing ideas. To clarify our approach and show how it may be practically applied, we present a case study in an environment based on the classic arcade game, Asteroids 2 . Within this environment there are two tasks defined by sparse rewards: a goal-seek task where the only feedback provided is a +1 reward for reaching a goal zone, and a keep-alive task where the only non-zero reward is a -1 penalty for colliding with an asteroid. The agent must coordinate two continuous thrusters to steer the ship, while contending with inertia. We compare a hierarchical agent that is equipped with a parameterised skill for controlling the ship’s velocity versus a flat learner, i.e. an agent that does not exploit hierarchy in the action space and controls the thrusters directly. In this setting, we show that the hierarchical agent tends to explore actions with greater relevance to the task than those chosen by the flat agent. As a result, its long-term 1 Masson et al. focused on determining when to switch between different types of parameterised skill, i.e. when to stop dribbling and take a shot. 2 https://en.wikipedia.org/wiki/Asteroids (video game)