Reinforcement learning and human behavior Hanan Shteingart 1 and Yonatan Loewenstein 1,2,3,4 The dominant computational approach to model operant learning and its underlying neural activity is model-free reinforcement learning (RL). However, there is accumulating behavioral and neuronal-related evidence that human (and animal) operant learning is far more multifaceted. Theoretical advances in RL, such as hierarchical and model-based RL extend the explanatory power of RL to account for some of these findings. Nevertheless, some other aspects of human behavior remain inexplicable even in the simplest tasks. Here we review developments and remaining challenges in relating RL models to human operant learning. In particular, we emphasize that learning a model of the world is an essential step before or in parallel to learning the policy in RL and discuss alternative models that directly learn a policy without an explicit world model in terms of state-action pairs. Addresses 1 Edmond and Lily Safra Center for Brain Sciences, The Hebrew University, Jerusalem 91904, Israel 2 Department of Neurobiology, The Alexander Silberman Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel 3 Department of Cognitive Science, The Hebrew University, Jerusalem 91904, Israel 4 Center for the Study of Rationality, The Hebrew University, Jerusalem 91904, Israel Corresponding author: Loewenstein, Yonatan (yonatan@huji.ac.il) Current Opinion in Neurobiology 2014, 25:9398 This review comes from a themed issue on Theoretical and computational neuroscience Edited by Adrienne Fairhall and Haim Sompolinsky 0959-4388/$ see front matter, # 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.conb.2013.12.004 Model-free RL The computational problem in many operant learning tasks can be formulated in a framework known as Markov Decision Processes (MDP) [1]. In MDPs, the world can be in one of several states, which determine the consequences of the agent’s actions with respect to the future rewards and world states. A policy defines the agent behavior at a given situation. In MDPs, a policy is a mapping from the states of the environment to actions to be taken when in those states [1]. Finding the optimal policy is difficult because actions may have both immediate and long-term consequences. How- ever, this problem can be simplified by estimating values, the expected cumulative (discounted) rewards associated with these states and actions and using these values to choose the actions (for a detailed character- ization of the mapping from values to actions in humans, see [2 ]). Model-free RL, as its name suggests, is a family of RL algorithms devised to learn the values of the states with- out learning the full specification of the MDP. In a class of model-free algorithms, known as temporal-difference learning, the learning of the values is based on the reward-prediction error (RPE), the discrepancy between the expected reward before and after an action is taken (taking into account also the ensuing obtained reward). The hypothesis that the brain utilizes model-free RL for operant learning holds considerable sway in the fields of neuroeconomics. This hypothesis is supported by exper- iments demonstrating that in primates, the phasic activity of mid-brain dopaminergic neurons is correlated with the RPE [3,4]. In mice, this correlation was also shown to be causal: optogenetic activation of dopamin- ergic neurons is sufficient to drive operant learning, supporting the hypothesis that the dopaminergic neurons encode the RPE, which is used for operant learning [5 ]. Other putative brain regions for this computation are the striatum, whose activity is correlated with values of the states and/or actions [6,7] and the nucleus accumbens and pallidum, which are involved in the selection of the actions [8]. In addition to its neural correlates, model-free RL has been used to account for the trial-by-trial dynamics (e.g., [2 ]) and for several robust aggregate features of human behavior such as risk aversion [9], recency [10] and primacy [2 ]. Moreover, model-free RL has been proven useful in the field of computational psychiatry as a way of diagnosing and characterizing different pathologies [1113,14 ]. However, there is also evidence that the correspon- dence between dopaminergic neurons and the RPE is more complex and diverse than was previously thought [15]. First, dopaminergic neurons increase their firing rate in response to both surprisingly positive and negative reinforcements [16,17]. Second, dopamin- ergic activity is correlated with other variables of the task, such as uncertainty [18]. Third, the RPE is not exclusively represented by dopamine, as additional neuromodulators, in particular serotonin, are also cor- related with the RPE [19]. Finally, some findings suggest that reinforcement and punishment signals are not local but rather ubiquitous in the human brain [20]. These results challenge the dominance of the anatomically modular model-free RL as a model for operant learning. Available online at www.sciencedirect.com ScienceDirect www.sciencedirect.com Current Opinion in Neurobiology 2014, 25:9398