Book Reviews
Reinforcement Learning
R. S. Sutton and A. G. Barto
Cambridge, MA: MIT Press, 1998.
Hardbound, $40.00. ISBN 0-262-19398-1
Reviewed by C. R. Gallistel
Reinforcement learning, as understood by Sutton and
Barto, is a fusion of the trial-and-error “law-of-effect”
tradition in psychology, optimal control theory in engi-
neering, the secondary reinforcement tradition in learn-
ing, and the use of decaying stimulus traces in, for
example, Hull’s (1952) concept of a goal gradient and,
more recently, Wagner’s (1981) model of conditioning.
This fusion has given researchers in artiªcial intelligence
a number of ideas for computer algorithms that learn a
policy that maximizes the agent’s long term return
(amount of reward) from the performance of a task.
Although many of the ideas behind reinforcement
learning originated in psychological theorizing, in recent
years these ideas have been most extensively developed
within the artiªcial learning community, particularly by
the authors of this important summary, their students,
and colleagues. The book is intended for use in a one-se-
mester course for students interested in machine learn-
ing and artiªcial intelligence. It would probably not be
suitable for a course intended for psychology and neu-
roscience students, because it does not present models
of experimentally established behavioral or neuroscien-
tiªc phenomena, and the problems given to illustrate
how reinforcement learning algorithms may be applied
are not necessarily problems that animals (even human
animals) are notably good at solving (e.g., efªcient sched-
uling problems). However, reinforcement learning, incen-
tive, and utility remain central concepts in contemporary
work on the neurobiological basis of learned, goal-
directed behavior (Schultz, Dayan, & Montague, 1997;
Shizgal, 1997), and this book is the place to look for the
latest ideas on how these concepts may be developed
into effective models for the direction of action.
The preface says that the only mathematical back-
ground assumed is familiarity with elementary concepts
of probability, such as expectations of random variables.
Most students will feel that rather more than that is in
fact assumed. Nonetheless, the material is presented in
an intuitively understandable form, emphasizing the ba-
sic ideas and giving helpful illustrations of their applica-
tion, rather than elaborating proofs. It can be read with
proªt by motivated neuroscience and psychology stu-
dents interested in a more rigorous development of
these psychologically important ideas. The biographical
and historical sections at the end of each chapter are
useful for the perspective they give. The many suggested
exercises are challenging, open ended, and thought pro-
voking.
Basic Concepts in Reinforcement Learning
The reward function is the objective feedback from the
environment. Rewards are integer or scalar variables as-
sociated with some states or state-action pairs. This asso-
ciation (the reward function) deªnes the goal of the
agent in a given situation. For example, a reward of 1
might be associated with the state of having moved all
one’s pieces off the board in a backgammon game, while
a reward of 0 is associated with all other states of the
game (all the states leading up to the winning state). The
agent’s sole objective is to maximize net long-term re-
ward (e.g., in a backgammon playing agent, number of
games won). The reward function deªnes what is objec-
tively good and bad for the agent. The reward function
is unalterable by the agent. In traditional psychological
terms, the reward function speciªes the states associated
with primary reinforcement.
The concepts of state and action are very general. An
action is any decision an agent might need to learn how
to make, and a state is any factor that the agent might
take into consideration in making that decision. The state
on which an action is predicated may include a model
of the environment. This model might represent past
states of the actor’s environment. This representation
would be considered part of the agent’s state at the time
it decides on an action. Thus, reinforcement learning,
unlike the more extreme (or pure) forms of neural net
learning, is not necessarily asymbolic. It seems to be at
least in principle, neutral on the issue of what knowl-
edge is and how it determines action.
The agent’s policy (the policy function) is the map-
ping from possible states to possible actions. The map-
ping may be a simple look-up table, what is sometimes
called a stored policy. A psychologist would call it a set
of stimulus-response associations. Alternatively, the agent
may rely on a computed policy. A computed policy may
involve a search through an ever-changing tree of values,
© 1999 Massachusetts Institute of Technology Journal of Cognitive Neuroscience 11:1, pp. 126–134