Near Optimal On-Policy Control Matthew Robards, Peter Sunehag Australian National University NICTA Abstract. We introduce two online gradient-based reinforcement learn- ing algorithms with function approximation – one model based, and the other model free – for which we provide a regret analysis. Our regret analysis has the benefit that, unlike many other gradient based algo- rithm analyses for reinforcement learning with function approximation, it makes no probabilistic assumptions meaning that we need not assume a fixed behavior policy. 1 Introduction and Background The ability to learn online is an important trait for reinforcement learning (RL) algorithms. Recently, there has been significant focus on using stochastic gradient descent to enable online reinforcement learning, [1], [5], [8], [9] with significant theoretical advances. We will here introduce two new algorithms for reinforcement learning with function approximation; one of which can be understood as model-based rein- forcement learning, and the other model-free. The methods in [1], [5], [8], [9] are all model free, and they are shown to converge under the assumption that the next state and reward is drawn from a steady-state distribution given the current state. This requires the agent to follow a fixed behavior policy. We wish to give theoretical analyses of our algorithms without placing restrictions on the behav- ior policy, and hence we look to an analysis with no probabilistic assumptions – regret bounds. 1.1 Related Theoretical Analyses The classical methods SARSA(λ) and Q-estimation were introduced [7] in the tabular reinforcement learning setting and were heuristically extended to linear function approximation. These methods, however, are known to have conver- gence issues in this more general setting. Such problems have recently been addressed by a series of gradient descent methods proposed for temporal differ- ence learning and optimal control [5], [8], [9]. Unlike the present work, however, these methods only come with guarantees in the very restricted case of having a fixed behavior policy. This author was supported by ARC grant DP0988049.