Abstract— Temporal difference learning (TDL) is perhaps the most widely used reinforcement learning method and gives competitive results on a range of problems, especially when using linear or table-based function approximators. However, it has been shown to give poor results on some continuous control problems and an important question is how it can be applied to such problems more effectively. The crucial point is how TDL can be generalized and scaled to deal with complex, high-dimensional problems without suffering from the curse of dimensionality. We introduce a new function approximation architecture called the Interpolated N- Tuple network and perform a proof-of-concept test on a classic reinforcement learning problem of pole balancing. The results show the method to be highly effective on this problem. They offer an important counter-example to some recently reported results that showed neuro- evolution outperforming TDL. The TDL with Interpolated N-Tuple networks learns to balance the pole considerably faster than the leading neuro-evolution techniques. I. INTRODUCTION Interaction with the environment is the most natural way for a human being to learn and hence a key question and motive in the machine learning area is how best to replicate this process algorithmically. Studying this connection produces a wealth of information about cause and effect, about the consequences of actions, and about what to do in order to achieve goals, and reinforcement learning (RL) is all about that. The agent and the environment interact continually, in a starting state s t the agent selecting action a t and the environment responding to that action and presenting new situations to the agent s t+1 , see Fig. 1. The environment also gives rise to rewards, r t+1 , special numerical values that the agent tries to maximize over time. Temporal Difference Learning (TDL) is one of the most popular methods in RL that stores all its experience in a value function, [8]. By following the temporal difference between the current and the future states the agent has to maximize the reward signal by time, the value function is updated as follows: V(s t ) ← V(s t ) + α [r t+1 + γV(s t+1 ) - V(s t )] 1 A. A. Abdullahi and S. M. Lucas are with the School of Computer Science and Electronic Engineering, University of Essex, Colchester, United Kingdom, (email: {aamabd, sml}@essex.ac.uk). where V(s) is the value of a state s, α is the learning rate, and γ is the discount factor which is used to determine the present values of future rewards,[8] . In tasks with small, finite state sets, it is possible to represent the value function using arrays or tables with one entry for each state (or state-action pair), [8]. This is called the tabular case, and the corresponding methods are called tabular methods. However, when it comes to real world problems the system states are either very large or continuous which makes a direct tabular approach infeasible. The only way to learn anything at all is to generalize from previously experienced states to ones that have never been seen. That process is widely known as Function Approximation. The simplest approach is to limit the problem space to a discrete number of states, and map the problem states to them. There are two extremes in this approach: first, is to roughly discretise the space which can lead to a very poor representation of the problem (i.e. significantly different states being dealt with in the same way). The intuitive alternative is making that much finer, in order to differentiate between all states. The downside of doing this is not just the large amount of memory needed but the time and data needed to access and train all possible system states. The exponential growth of table size with respect to the number of input dimensions is known as the curse of dimensionality. The goal of this paper is to examine a new function approximation technique that aims to alleviate these issues. The main idea is to use the recently developed interpolated table value function [6], for smooth function approximation, and configure them as N-Tuple networks to overcome the curse of dimensionality. The pole balancing problem was used as an appropriate test-bed, with its rich history with a wide range of RL methods. The rest of the paper is organized as follows. First the pole-balancing problem definition is given together with some background about previously applied methods. Then there is a section for each Temporal Difference Learning with Interpolated N-Tuple Networks: Initial Results on Pole Balancing Aisha A. Abdullahi 1 and Simon M. Lucas 1 Fig. 1: The agent interaction with the environment.