Book Reviews Reinforcement Learning R. S. Sutton and A. G. Barto Cambridge, MA: MIT Press, 1998. Hardbound, $40.00. ISBN 0-262-19398-1 Reviewed by C. R. Gallistel Reinforcement learning, as understood by Sutton and Barto, is a fusion of the trial-and-error “law-of-effect” tradition in psychology, optimal control theory in engi- neering, the secondary reinforcement tradition in learn- ing, and the use of decaying stimulus traces in, for example, Hull’s (1952) concept of a goal gradient and, more recently, Wagner’s (1981) model of conditioning. This fusion has given researchers in artiªcial intelligence a number of ideas for computer algorithms that learn a policy that maximizes the agent’s long term return (amount of reward) from the performance of a task. Although many of the ideas behind reinforcement learning originated in psychological theorizing, in recent years these ideas have been most extensively developed within the artiªcial learning community, particularly by the authors of this important summary, their students, and colleagues. The book is intended for use in a one-se- mester course for students interested in machine learn- ing and artiªcial intelligence. It would probably not be suitable for a course intended for psychology and neu- roscience students, because it does not present models of experimentally established behavioral or neuroscien- tiªc phenomena, and the problems given to illustrate how reinforcement learning algorithms may be applied are not necessarily problems that animals (even human animals) are notably good at solving (e.g., efªcient sched- uling problems). However, reinforcement learning, incen- tive, and utility remain central concepts in contemporary work on the neurobiological basis of learned, goal- directed behavior (Schultz, Dayan, & Montague, 1997; Shizgal, 1997), and this book is the place to look for the latest ideas on how these concepts may be developed into effective models for the direction of action. The preface says that the only mathematical back- ground assumed is familiarity with elementary concepts of probability, such as expectations of random variables. Most students will feel that rather more than that is in fact assumed. Nonetheless, the material is presented in an intuitively understandable form, emphasizing the ba- sic ideas and giving helpful illustrations of their applica- tion, rather than elaborating proofs. It can be read with proªt by motivated neuroscience and psychology stu- dents interested in a more rigorous development of these psychologically important ideas. The biographical and historical sections at the end of each chapter are useful for the perspective they give. The many suggested exercises are challenging, open ended, and thought pro- voking. Basic Concepts in Reinforcement Learning The reward function is the objective feedback from the environment. Rewards are integer or scalar variables as- sociated with some states or state-action pairs. This asso- ciation (the reward function) deªnes the goal of the agent in a given situation. For example, a reward of 1 might be associated with the state of having moved all one’s pieces off the board in a backgammon game, while a reward of 0 is associated with all other states of the game (all the states leading up to the winning state). The agent’s sole objective is to maximize net long-term re- ward (e.g., in a backgammon playing agent, number of games won). The reward function deªnes what is objec- tively good and bad for the agent. The reward function is unalterable by the agent. In traditional psychological terms, the reward function speciªes the states associated with primary reinforcement. The concepts of state and action are very general. An action is any decision an agent might need to learn how to make, and a state is any factor that the agent might take into consideration in making that decision. The state on which an action is predicated may include a model of the environment. This model might represent past states of the actor’s environment. This representation would be considered part of the agent’s state at the time it decides on an action. Thus, reinforcement learning, unlike the more extreme (or pure) forms of neural net learning, is not necessarily asymbolic. It seems to be at least in principle, neutral on the issue of what knowl- edge is and how it determines action. The agent’s policy (the policy function) is the map- ping from possible states to possible actions. The map- ping may be a simple look-up table, what is sometimes called a stored policy. A psychologist would call it a set of stimulus-response associations. Alternatively, the agent may rely on a computed policy. A computed policy may involve a search through an ever-changing tree of values, © 1999 Massachusetts Institute of Technology Journal of Cognitive Neuroscience 11:1, pp. 126–134