Relational Temporal Difference Learning Nima Asgharbeygi nimaa@stanford.edu David Stracuzzi stracudj@csli.stanford.edu Pat Langley langley@csli.stanford.edu Center for the Study of Language and Information, Computational Learning Laboratory, Stanford University, Stanford, CA 94305 USA Abstract We introduce relational temporal difference learning as an effective approach to solv- ing multi-agent Markov decision problems with large state spaces. Our algorithm uses temporal difference reinforcement to learn a distributed value function represented over a conceptual hierarchy of relational predi- cates. We present experiments using two do- mains from the General Game Playing repos- itory, in which we observe that our sys- tem achieves higher learning rates than non- relational methods. We also discuss related work and directions for future research. 1. Background and Motivation Most research in AI views intelligent behavior as search through a problem space to achieve goals. Di- recting that search is crucial to an agent’s success, but crafting search-control heuristics manually is dif- ficult and prone to error. An alternative response is to acquire such heuristic knowledge through learning. One common approach formulates this task as learn- ing control policies from delayed reward, with policies encoded by expected value functions over Markov de- cision processes (Sutton & Barto, 1998). This general approach to reinforcement learning has been studied in many settings and from many perspectives. Most work in this tradition uses limited representa- tions and downplays the role of background knowl- edge. As a result, typical systems search a very large state space and thus learn far more slowly than do humans placed in similar situations. Research on tem- poral abstraction (Dietterich, 2000) and state abstrac- Appearing in Proceedings of the 23 rd International Con- ference on Machine Learning, Pittsburgh, PA, 2006. Copy- right 2006 by the authors. tion (Asadi & Huber, 2004) aims to increase learning rates, but few efforts have utilized the more powerful relational representations that are standard in other AI subfields. Recent work on relational reinforcement learning (Dzeroski et al., 2001) uses first-order repre- sentations to provide effective abstraction, but it does not take advantage of action models, which are an im- portant source of knowledge in many domains. In this paper, we report a new approach to learn- ing from delayed reward in multi-player games. Our framework is similar to relational reinforcement learn- ing in its reliance on first-order representations. How- ever, it employs a variant of temporal differencing, which is more appropriate than Q-learning when an action model is available, as Tesauro (1994) and Bax- ter et al. (1998) have demonstrated. As in Dzeroski et al.’s work, we use a relational rep- resentation to support effective generalization across states, which should produce more rapid learning. However, rather than using relational regression trees to encode expected values, we use a factored repre- sentation that associates component values with rela- tional predicates. These are combined into an over- all score, much as in traditional state evaluation func- tions. Our work offers a novel approach to combin- ing ideas from relational reinforcement learning and feature-based temporal difference learning. In the next section, we describe our representation of states, moves, and expected values, the performance system that uses this knowledge to play games, and our method for relational temporal difference learning. We then present experiments designed to demonstrate the advantages of this approach. This includes dis- cussion of the general game playing domain and the specific games on which we evaluate our method. We conclude by discussing related work and outlining di- rections for future research on relational learning from delayed reward.