Reinforcement Learning for True Adaptive Traffic Signal Control Baher Abdulhai 1 ; Rob Pringle 2 ; and Grigoris J. Karakoulas 3 Abstract: The ability to exert real-time, adaptive control of transportation processes is the core of many intelligent transportation systems decision support tools. Reinforcement learning, an artificial intelligence approach undergoing development in the machine- learning community, offers key advantages in this regard. The ability of a control agent to learn relationships between control actions and their effect on the environment while pursuing a goal is a distinct improvement over prespecified models of the environment. Prespecified models are a prerequisite of conventional control methods and their accuracy limits the performance of control agents. This paper contains an introduction to Q-learning, a simple yet powerful reinforcement learning algorithm, and presents a case study involving application to traffic signal control. Encouraging results of the application to an isolated traffic signal, particularly under variable traffic conditions, are presented. A broader research effort is outlined, including extension to linear and networked signal systems and integration with dynamic route guidance. The research objective involves optimal control of heavily congested traffic across a two-dimensional road network—a challenging task for conventional traffic signal control methodologies. DOI: 10.1061/ASCE0733-947X2003129:3278 CE Database subject headings: Traffic signal controllers; Intelligent transportation systems; Traffic control; Traffic management; Adaptive systems. Introduction The ability to exert real-time, adaptive control over a transporta- tion process is potentially useful for a variety of intelligent trans- portation systems services, including control of a system of traffic signals, control of the dispatching of paratransit vehicles, and control of the changeable message displays or other cues in a dynamic route guidance system, to name a few. In each case, the controlling actions should respond to actual environmental conditions—vehicular demand in the case of a signal system, the demand for multiple paratransit trip origins and destinations, or the road network topology and traffic conditions in the case of dynamic route guidance. Even more valuable is the ability to control in accordance with an optimal strategy defined in terms of one or more performance objectives. For example, one might wish to have a signal control strategy that minimizes delay, a paratransit dispatching system that minimizes wait time and ve- hicle kilometers traveled, or a dynamic route guidance system that minimizes travel time. A key limitation of conventional control systems is a require- ment for one or more prespecified models of the environment. The purpose of these might be to convert sensory inputs into a useful picture of current or impending conditions or provide an assessment of the probable impacts of alternative control actions in a given situation. Such models require domain expertise to construct. Furthermore, they must often be sufficiently general to cover a variety of conditions, as it is usually impractical to pro- vide separate models to address each potential situation. For ex- ample, some state-of-the-art traffic signal control systems rely on a platoon-dispersion model to predict the arrival pattern of ve- hicles at a downstream signal based on departures from an up- stream signal. A generalized model designed to represent all road links cannot possibly reflect the impacts of the different combi- nations of side streets and driveways generating and absorbing traffic between the upstream and downstream signals. What if a controlling agent could directly learn the various relationships inherent in its world from its experience with differ- ent situations in that world? Not only would the need for model prespecification be obviated or at least minimized, but such an agent could effectively tailor its control actions to specific situa- tions based on its past experience with the same or similar situa- tions. The machine-learning research community, related to the artificial intelligence community, provides us with a variety of methods that might be adapted to transportation control problems. One of these, particularly useful due to its conceptual simplicity, yet impressive in its potential, is reinforcement learning see Sut- ton and Barto 1998or Kaelbling et al. 1996for comprehensive overviews, or Bertsekas and Tsitsiklis 1996for a more rigorous treatment. This paper provides a brief introduction to the concept of re- inforcement learning. As a case study, reinforcement learning is applied to the case of an isolated traffic signal with encouraging results. This is the first stage in a research program to develop a signal system control methodology, based on reinforcement learn- 1 Assistant Professor and Director, Intelligent Transportation Systems Centre, Dept. of Civil Engineering, Univ. of Toronto, Toronto, ON, Canada M5S 1A4. E-mail: baher@ecf.utoronto.ca 2 PhD Candidate, Intelligent Transportation Systems Centre, Dept. of Civil Engineering, Univ. of Toronto, Toronto, ON, Canada M5S 1A4. E-mail: rob.pringle@utoronto.ca 3 Dept. of Computer Science, Univ. of Toronto, Pratt Building LP283E, 6 King’s College, Toronto, ON, Canada M5S 1A4. E-mail: grigoris@cs.toronto.edu Note. Discussion open until October 1, 2003. Separate discussions must be submitted for individual papers. To extend the closing date by one month, a written request must be filed with the ASCE Managing Editor. The manuscript for this paper was submitted for review and pos- sible publication on October 30, 2001; approved on May 21, 2002. This paper is part of the Journal of Transportation Engineering, Vol. 129, No. 3, May 1, 2003. ©ASCE, ISSN 0733-947X/2003/3-278 –285/$18.00. 278 / JOURNAL OF TRANSPORTATION ENGINEERING © ASCE / MAY/JUNE 2003