V Value Function Approximation Michail G. Lagoudakis Technical University of Crete Synonyms Approximate Dynamic Programming, Neuro-dynamic Programming, Cost-to-go Function Approximation Definition e goal in sequential decision making under uncer- tainty is to find good or optimal policies for selecting actions in stochastic environments in order to achieve a long term goal; such problems are typically modeled as Markov Decision Processes (MDPs). A key concept in MDPs is the value function, a real-valued function that summarizes the long-term goodness of a decision into a single number and allows the formulation of opti- mal decision making as an optimization problem. Exact representation of value functions in large real-world problems is infeasible, therefore a large body of research has been devoted to value function approximation meth- ods, which sacrifice some representation accuracy for the sake of scalability. ese approaches have delivered effective approaches to deriving good policies in hard decision problems and laid the foundation for efficient reinforcement learning algorithms, which learn good policies in unknown stochastic environments through interaction. Motivation and Background Markov Decision Processes A Markov Decision Process (MDP) is a six-tuple (S , A, P , R, γ, D), where S is the state space of the process, A is a finite set of actions, P is a Markovian transition model (P(s s, a) denotes the probability of a transition to state s when taking action a in state s), R is a reward function (R(s, a) is the reward for tak- ing action a in state s), γ ∈(, ] is the discount factor for future rewards (a reward received aſter t steps is weighted by γ t ), and D is the initial state distribution (Puterman, ). MDPs are discrete-time processes. e process begins at time t =  in some state s S drawn from D. At each time step t , the decision maker observes the current state of the process s t S and chooses an action a t A. e next state of the process s t+ is drawn stochastically according to the transition model P (s t+ s t , a t ) and the reward r t at that time step is determined by the reward function R(s t , a t ). e hori- zon h is the temporal extent of each run of the process and is typically infinite. A complete run of the process over its horizon is called an episode and consists of a long sequence of states, actions, and rewards: s a  r s a → r s ...s h a h- → r h- s h . e quantity of interest is the expected total dis- counted reward from any state s: E (r + γr + γ r + γ r + ⋅⋅⋅+ γ h r h s = s) = E ( h t= γ t r t s = s) , where the expectation is taken with respect to all possible trajectories of the process in the state space under the decisions made and the transition model, assuming that the process is initialized in state s. e goal of the decision maker is to make decisions so that the expected total discounted reward, when s is drawn from D, is optimized. (e optimization objec- tive could be maximization or minimization depending on the problem. Here, we adopt a reward maximization viewpoint, but there are analogous definitions for cost minimization. ere are also other popular optimality measures, such as maximization/minimization of the average reward/cost per step.) Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----, © Springer Science+Business Media LLC 