Approximate dynamic programming Optimal decisions, Part 9 Christos Dimitrakakis November 14, 2012 1 Introduction In this chapter, we consider approximate dynamic programming. This includes all methods with approximations in the maximisation step, methods where the value function used is approximate, or methods where the policy used is some approximation to the optimal policy. We ﬁrst consider the case where we have an approximate value function. Let u ∈V be an approximate optimal value function obtained via some arbitrary method. Then we can deﬁne the greedy policy with respect to it as follows: Deﬁnition 1 (u-greedy policy and value function). π ∗ u ∈ arg max π L π u, v ∗ u = L u, (1.1) where π : S→ D (A) maps from states to action distributions. Although previously policies did not need to be stochastic, here we are ex- plicitly considering stochastic policies to facilitate the approximations. Never- theless, frequently, we cannot actually perform this maximisation if the state or action space are very large. So we deﬁne φ, a distribution on S , and parametrised sets of value functions V Θ and policies Π Θ . Parameteric value function estimation V Θ = {v θ | θ ∈ Θ} , θ ∗ ∈ arg min θ∈Θ ‖v θ − u‖ φ (1.2) where ‖·‖ φ  ∫ S |·| dφ. In other words, we ﬁnd the value function best matching the approximate value function u. If u = V ∗ then we end up getting the best possible approxi- mation with respect to the distribution φ. Parameteric policy estimation Π Θ = {π θ | θ ∈ Θ} , θ ∗ ∈ arg min θ∈Θ ‖π θ − π ∗ u ‖ φ (1.3) where π ∗ u = arg max π∈Π L π u Example 2. A simple case is when φ does not support S , that is it only takes positive values for some states s ∈S . 1