Learning near-optimal policies with fitted policy iteration and a single sample path Andr´ as Antos and Csaba Szepesv ´ ari Computer and Automation Research Inst. of the Hungarian Academy of Sciences Kende u. 13-17, Budapest 1111, Hungary {antos,szcsaba}@sztaki.hu emi Munos Centre de Math´ ematiques Appliqu´ ees Ecole Polytechnique 91128 Palaiseau Cedex, France remi.munos@polytechnique.fr Abstract In this paper we consider the problem of learning a near-optimal policy in continuous-space, expected total discounted-reward Markovian Decision Problems using approximate policy iteration. We consider batch learning where the training data consists of a single sample path of a fixed, known, persistently-exciting stationary stochastic policy. We derive PAC-style bounds on the difference of the performance of the policy returned by the algorithm and the optimal value function in both L and weighted L 2 -norms. 1 Introduction Reinforcement learning (RL) deals with the problem of how to choose actions so as to maximize some long-term performance index [14]. Here we assume that the environment of the decision maker can be described by a Markovian Decision Problem (MDP). In an MDP the evolution of states are controlled by selecting actions in each time step. State transitions are stochastic. Further, in each time step the agent controlling the MDP receives a random reward whose distribution depends on the state just visited and the action last executed. In this paper we shall be concerned with MDPs where the long-term performance index is given by the expected total discounted reward. We study batch-learning problems where the MDP is unknown but a sufficiently rich sample path, the execution trace of a fixed, known stochastic stationary policy, is available at the beginning of learning. The optimal action-value (or in short Q-) function underlying an MDP maps state-action pairs to reals. For a given state and action it gives the expected total discounted reward given that the process is started at the specified state, the action selected in the first time-step equals to the specified action, and in all subsequent steps optimal actions are chosen. The optimal action-value function plays a crucial role in MDPs: Knowing it suffices to construct an optimal policy, i.e., one that achieves the largest possible expected total discounted reward for any start-state. Classical value-function based methods (more precisely, their variants that construct the optimal action-value function), such as value iteration and policy iteration iteratively approximate the optimal action-value function. Under mild conditions, due to the presence of the discount factor, the iterates converge at a geometric rate to the optimal action-value function (see e.g. [3]). These methods assume that the state-action pairs can be enumerated and hence action-value functions can be represented as vectors of