Acquiring Diverse Predictive Knowledge in Real Time by Temporal-difference Learning Joseph Modayil and Adam White and Patrick M. Pilarski and Richard S. Sutton 1 Abstract. Existing robot algorithms demonstrate several capabil- ities that are enabled by a robot’s knowledge of the temporally- extended consequences of its behaviour. This knowledge consists of real-time predictions—predictions that are conventionally computed by iterating a small one-timestep model of the robot’s dynamics. Given the utility of such predictions, alternatives are desirable when this conventional approach is not applicable, for example when an adequate model of the one-timestep dynamics is either not available or not computationally tractable. We describe how a robot can both learn and make many such predictions in real-time using a standard reinforcement learning algorithm. Our experiments show that a mo- bile robot can learn and make thousands of accurate predictions at 10 Hz about the future of all of its sensors and many internal state vari- ables at multiple time-scales. The method uses a single set of features and learning parameters that are shared across all the predictions. We demonstrate the generality of these predictions with an application to a different platform, a robot arm operating at 50 Hz. Here, the pre- dictions are about which arm joint the user wants to move next, a dif- ﬁcult situation to model analytically, and we show how the learned predictions enable measurable improvements to the user interface. The predictions learned in real-time by this method constitute a ba- sic form of knowledge about the robot’s interaction with the environ- ment, and extensions of this method can express more general forms of knowledge. 1 Introduction A robot’s ability to make real-time predictions about the conse- quences of its behaviour supports several additional capabilities. Ex- amples of robot capabilities built on real-time predictions include collision avoidance [Fox et al., 1997], stability [Abbeel et al., 2010], and motion planning [LaValle, 2006]. The conventional approach to make these predictions is to manually construct a small one-timestep model of the system dynamics ofﬂine, and then, during real-time operation, to make temporally-extended predictions by simulating future trajectories with the model. However, this approach requires a one-timestep model of the dynamics to be available, and it re- quires computationally expensive simulations with the model to pre- dict quantities of interest. We propose an alternate approach for real-time predictions, namely to learn to directly predict the temporally-extended conse- quences of a behaviour in real-time. This is the technique used for the critic’s value function in an actor-critic based method. We demon- strate that this direct approach scales well for learning and making 1 Reinforcement Learning and Artiﬁcial Intelligence Laboratory, Department of Computing Science, University of Alberta, Canada. email: {jmodayil, awhite, pilarski, sutton} @cs.ualberta.ca many temporally extended predictions in parallel, and thus poten- tially opens the door to new robot capabilities. The main contribution of this work is an empirical demonstration that thousands of temporally-extended predictions can be learned on- line in real-time with high accuracy. The predictions are in the form of questions about future sensor values and internal state bits. We demonstrate that a mobile robot can both learn and make thousands of predictions in real-time. In our ﬁrst experimental setting, predic- tions are made every 100ms, and the predictions are about the robot’s future sensor readings and internal state variables either at the next timestep in 100ms, or over the next short time scale of 0.5, 2, or 8 seconds. These predictions provide the robot with immediate knowl- edge about many distinct, temporally extended consequences of its behaviour. In a second experimental setting, we demonstrate the gen- erality of these predictions by evaluating how they can improve the user interface for a robot arm. The approach is novel in several respects. The predictions have the beneﬁt of scientiﬁc empiricism—the predictions can be evalu- ated for their accuracy by comparison to the robot’s future expe- rience. Although directly learning the temporally extended conse- quences of behaviour is not a common way of representing knowl- edge in robotics, these predictions can also be assembled to form a conventional one-timestep model of the dynamics. The ease of ac- quiring this knowledge, the generality of the method, and known ex- tensions to the prediction algorithm, suggest that this is a promising direction for further investigation. The paper is structured as follows. First, we present the method to describe the learning setting precisely. Then, we show results from our experimental evaluation of the method on a mobile robot. We then demonstrate the generality of this method with an application to the completely different domain of predictions for a human-guided robot arm. After describing related work, we discuss how this method can be extended to more general forms of predictions. 2 Method The method relies on learning many temporally-extended predic- tions, so we ﬁrst review the underlying temporal-difference pre- diction algorithm TD(λ) [Sutton, 1988]. As input at each timestep t ∈ N, the algorithm receives the feature vector xt ∈ R n . The feature vector is the robot’s description of the state of the environ- ment st . Note that the description provided by xt will be restricted to features that the robot can readily compute, and this is typically an incomplete characterization of the state of the environment. Each predictive question pertains to some signal rt ∈ R that is observed at each timestep. The signal is called the reward in reinforcement learning, but here it is an arbitrary target signal and does not indicate