Approximate dynamic programming Optimal decisions, Part 9 Christos Dimitrakakis November 14, 2012 1 Introduction In this chapter, we consider approximate dynamic programming. This includes all methods with approximations in the maximisation step, methods where the value function used is approximate, or methods where the policy used is some approximation to the optimal policy. We first consider the case where we have an approximate value function. Let u ∈V be an approximate optimal value function obtained via some arbitrary method. Then we can define the greedy policy with respect to it as follows: Definition 1 (u-greedy policy and value function). π u arg max π L π u, v u = L u, (1.1) where π : S→ D (A) maps from states to action distributions. Although previously policies did not need to be stochastic, here we are ex- plicitly considering stochastic policies to facilitate the approximations. Never- theless, frequently, we cannot actually perform this maximisation if the state or action space are very large. So we define φ, a distribution on S , and parametrised sets of value functions V Θ and policies Π Θ . Parameteric value function estimation V Θ = {v θ | θ Θ} , θ arg min θΘ v θ u φ (1.2) where ‖·‖ φ S |·| dφ. In other words, we find the value function best matching the approximate value function u. If u = V then we end up getting the best possible approxi- mation with respect to the distribution φ. Parameteric policy estimation Π Θ = {π θ | θ Θ} , θ arg min θΘ π θ π u φ (1.3) where π u = arg max πΠ L π u Example 2. A simple case is when φ does not support S , that is it only takes positive values for some states s ∈S . 1