Neural control of dopamine neurotransmission: implications for reinforcement learning Mayank Aggarwal, 1 Brian I. Hyland 2 and Jeffery R. Wickens 1 1 Neurobiology Research Unit, Okinawa Institute of Science and Technology, 1919-1, Tancha, Onna-Son, Kunigami, Okinawa 904-0412, Japan 2 Department of Physiology, Medical School, Dunedin, New Zealand Keywords: basal ganglia, GABA, striatum, temporal difference Abstract In the past few decades there has been remarkable convergence of machine learning with neurobiological understanding of reinforcement learning mechanisms, exemplified by temporal difference (TD) learning models. The anatomy of the basal ganglia provides a number of potential substrates for instantiation of the TD mechanism. In contrast to the traditional concept of direct and indirect pathway outputs from the striatum, we emphasize that projection neurons of the striatum are branched and individual striatofugal neurons innervate both globus pallidus externa and globus pallidus interna ⁄ substantia nigra (GPi ⁄ SNr). This suggests that the GPi ⁄ SNr has the necessary inputs to operate as the source of a TD signal. We also discuss the mechanism for the timing processes necessary for learning in the TD framework. The TD framework has been particularly successful in analysing electrophysiogical recordings from dopamine (DA) neurons during learning, in terms of reward prediction error. However, present understanding of the neural control of DA release is limited, and hence the neural mechanisms involved are incompletely understood. Inhibition is very conspicuously present among the inputs to the DA neurons, with inhibitory synapses accounting for the majority of synapses on DA neurons. Furthermore, synchronous firing of the DA neuron population requires disinhibition and excitation to occur together in a coordinated manner. We conclude that the inhibitory circuits impinging directly or indirectly on the DA neurons play a central role in the control of DA neuron activity and further investigation of these circuits may provide important insight into the biological mechanisms of reinforcement learning. Introduction Understanding of the neural mechanisms for reinforcement learning took a leap forward in the late 1990s, due to the convergence of results from machine learning, recordings from dopamine (DA) neurons and DA-dependent synaptic plasticity in the corticostriatal pathway. This convergence led to the concept of the DA signal as a reward prediction error, which could be calculated by the temporal difference (TD) learning algorithm. In the decades that followed, TD learning proved to be a powerful concept for interpreting DA cell activity, and has been applied to the analysis of neurophysiological recordings and data from imaging studies with remarkable success. Although it was not originally introduced as a biological model, the model is predictive and can be instantiated in a biologically plausible architecture. This has led to a common view that to understand TD learning is to understand DA cell firing. Here we consider the validity of the TD model as a biological model. After outlining the assumptions of the TD framework, and putative biological substrates, we examine the biological reality of such a proposal. Although it is possible to fit some parts of the model, there are some important assumptions that do not fit with current biological knowledge. We consider the implications of the biological reality for the future development of models to explain DA cell activity. The traditional TD learning framework We define the traditional TD learning framework as exemplified by the description in Montague et al. (1996). Figure 1 shows this framework adapted from Pan et al. (2005); we will use the terminology in this figure. All TD learning models have certain common features that can be mapped on to a network, with nodes and connections. Here, a node is a processing point at which connections converge and ⁄ or diverge. A connection transmits information from one node to another and has states associated with it, for example an eligibility state. TD learning is not intended as a one-to-one biological model. However, as applied to the DA system, the TD learning framework makes a number of assumptions about the architecture and operations of the neural system that performs TD learning. In the simplest possible instantiation, a node might be equated with a neuron. However, in general, a node is more like a nucleus or network of many neurons that performs an operation. Similarly, a Correspondence: Dr J. R. Wickens, as above. E-mail: wickens@oist.jp Received 17 November 2011, revised 22 January 2012, accepted 30 January 2012 European Journal of Neuroscience, Vol. 35, pp. 1115–1123, 2012 doi:10.1111/j.1460-9568.2012.08055.x ª 2012 The Authors. European Journal of Neuroscience ª 2012 Federation of European Neuroscience Societies and Blackwell Publishing Ltd European Journal of Neuroscience