Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007) OpposiAtion-B ased Reilnforcemient Learn'ing in the M\anagem-ent of' Water Resources M. Mahootchi, H. R. Tizhoosh, K. Ponnambalam Systems Design Engineering, University of waterloo, 200 University Avenue West, Waterloo, Ontario, N2L 3G], Canada, mmnahootc @ engmail. uwaterloo. Ca, tizhoosh @ uwaterloo. Ca, ponnu @ uwaterloo. ca Abstract- Opposition-Based Learning (OBL) is a new scheme in machine intelligence. In this paper, an OBL version Q-learning which exploits opposite quantities to accelerate the learning is used for management of single reservoir operations. In this method, an agent takes an action, receives reward, and updates its knowledge in terms of action-value functions. Fuurthermore, the transition function which is the balance equation in the optimization model determines the next state and updates the action- value function pertinent to opposite action. Two type of opposite actions will be defined. It will be demonstrated that using OBL can significantly improve the efficiency of the operating policy within limited iterations. It is also shown that this technique is more robust than Q-Learning. Index Terms- water reservoirs, Q-learning, opposite action, reinforcement learning. I. INTRODUCTION F INDING efficient operating policies in multi- reservoir applications has been a challenging re- search area in the past decades. Many attempts using traditional methods including linear and non-linear op- timization techniques have been performed to overcome the curse of dimensionality in real-world applications. However, most of these efforts have included different varieties of simplifications and approximations, which usually make the operating policies inefficient in prac- tice. Using optimization techniques along with simula- tion, such as Reinforcement Learning (RL) techniques, could be a suitable alternatives for this purpose. RL is a powerful and well-known technique in machine learning research to cope well with many optimization and simulation problems. It is also called Simulation- Based Dynamic Programming [I] in which a decision maker (agent) optimizes an objective function through interacting with deterministic or stochastic environments These interactions might cause some instant reward or punishment which are accumulated during the training proc ess and called action value functions These values are the basis for the agent to take proper actions in different situations (states). Based on what has been proven by Watkins [2] these values converge to steady states if each action state pair is visited for infinite number of times - practically multiple times; however, this may take too much time in real-world applications. Therefore, the question may come up how to achieve an optimal solution with fewer interactions. Opposition- Based Learning (OBL) scheme, which is firstly intro- duced by Tizhoosh [3], could be a suitable answer to the mentioned question. Tizhoosh has shown that using this scheme in some soft computing methods such as Genetic Algorithms (GA), Neural Networks (NN), and Reinforcement Learning (RL) can generally speed up the training process. However, this is completely problem dependent. He also used this scheme with Reinforce- ment Learning (RL) in finding a path to a fixed goal in discrete grid worlds of different sizes [3]. In this specific example, an agent takes an action in the current state and updates the respective action-value function in addition to those functions, which are related to opposite actions or states. The criterion for granting reward or punishment to an agent is the distance to a fixed goal inside the grid. Moreover, the environment under the study in this case study is totally deterministic. In this paper, we investigate the effect of opposition- Based Learning (OBL) scheme using Q-learning method for the reservoir management. To show that this scheme is efficient, it is applied on a single reservoir problem which is completely stochastic in terms of inflow to reservoir. Therefore we can easily find the optimal or near-optimal policies and performances by regular Q-Learning and simulation in a reasonable time. The corresponding results would consequently be extended for multi-reservoir applications in the future research. The paper is orranized as follows- In the next section, a simple model of a single reservoir will be explained. Section 3 will provide a general review of Stochastic Dynamic Programming (SDP). In section 4, Q-Learning will be briefly reviewed. Some basic concepts of OBL and a version of opposition-based algorithm using Q- Learning in the reservoir management will be described in section 5 and 6. Finally, in sections 7 and 8, some experimental results and conclusions will be provided. 1-4244-0706-0/07/$20.00 ©2007 IEEE 217