V
Value Function Approximation
Michail G. Lagoudakis
Technical University of Crete
Synonyms
Approximate Dynamic Programming, Neuro-dynamic
Programming, Cost-to-go Function Approximation
Definition
e goal in sequential decision making under uncer-
tainty is to find good or optimal policies for selecting
actions in stochastic environments in order to achieve a
long term goal; such problems are typically modeled as
⊲Markov Decision Processes (MDPs). A key concept
in MDPs is the value function, a real-valued function
that summarizes the long-term goodness of a decision
into a single number and allows the formulation of opti-
mal decision making as an optimization problem. Exact
representation of value functions in large real-world
problems is infeasible, therefore a large body of research
has been devoted to value function approximation meth-
ods, which sacrifice some representation accuracy for
the sake of scalability. ese approaches have delivered
effective approaches to deriving good policies in hard
decision problems and laid the foundation for efficient
reinforcement learning algorithms, which learn good
policies in unknown stochastic environments through
interaction.
Motivation and Background
Markov Decision Processes
A Markov Decision Process (MDP) is a six-tuple
(S , A, P , R, γ, D), where S is the state space of the
process, A is a finite set of actions, P is a Markovian
transition model (P(s
′
∣s, a) denotes the probability of
a transition to state s
′
when taking action a in state s),
R is a reward function (R(s, a) is the reward for tak-
ing action a in state s), γ ∈(, ] is the discount factor
for future rewards (a reward received aſter t steps is
weighted by γ
t
), and D is the initial state distribution
(Puterman, ). MDPs are discrete-time processes.
e process begins at time t = in some state s
∈ S
drawn from D. At each time step t , the decision maker
observes the current state of the process s
t
∈ S and
chooses an action a
t
∈ A. e next state of the process
s
t+
is drawn stochastically according to the transition
model P (s
t+
∣s
t
, a
t
) and the reward r
t
at that time step is
determined by the reward function R(s
t
, a
t
). e hori-
zon h is the temporal extent of each run of the process
and is typically infinite. A complete run of the process
over its horizon is called an episode and consists of a long
sequence of states, actions, and rewards:
s
a
→
r
s
a
→
r
s
...s
h−
a
h-
→
r
h-
s
h .
e quantity of interest is the expected total dis-
counted reward from any state s:
E (r
+ γr
+ γ
r
+ γ
r
+ ⋅⋅⋅+ γ
h
r
h
∣ s
= s)
= E (
h
∑
t=
γ
t
r
t
∣ s
= s) ,
where the expectation is taken with respect to all
possible trajectories of the process in the state space
under the decisions made and the transition model,
assuming that the process is initialized in state s. e
goal of the decision maker is to make decisions so
that the expected total discounted reward, when s is
drawn from D, is optimized. (e optimization objec-
tive could be maximization or minimization depending
on the problem. Here, we adopt a reward maximization
viewpoint, but there are analogous definitions for cost
minimization. ere are also other popular optimality
measures, such as maximization/minimization of the
average reward/cost per step.)
Claude Sammut & Geoffrey I. Webb (eds.), Encyclopedia of Machine Learning, DOI ./----,
© Springer Science+Business Media LLC