Proceedings of the 2007 IEEE Symposium on Approximate
Dynamic Programming and Reinforcement Learning (ADPRL 2007)
OpposiAtion-B
ased Reilnforcemient
Learn'ing
in
the
M\anagem-ent
of' Water Resources
M. Mahootchi, H. R. Tizhoosh, K. Ponnambalam
Systems Design Engineering, University of waterloo, 200 University Avenue West, Waterloo, Ontario,
N2L 3G], Canada, mmnahootc @ engmail. uwaterloo. Ca, tizhoosh @ uwaterloo. Ca, ponnu @ uwaterloo. ca
Abstract- Opposition-Based Learning (OBL) is a new
scheme in machine intelligence. In this paper, an OBL
version Q-learning which exploits opposite quantities to
accelerate the learning is used for management of single
reservoir operations. In this method, an agent takes an
action, receives reward, and updates its knowledge in
terms of action-value functions. Fuurthermore, the transition
function which is the balance equation in the optimization
model determines the next state and updates the action-
value function pertinent to opposite action. Two type of
opposite actions will be defined. It will be demonstrated
that using OBL can significantly improve the efficiency
of the operating policy within limited iterations. It is also
shown that this technique is more robust than Q-Learning.
Index Terms- water reservoirs, Q-learning, opposite
action, reinforcement learning.
I. INTRODUCTION
F INDING efficient operating policies in multi-
reservoir
applications
has been a
challenging
re-
search area in the past decades. Many attempts using
traditional methods including linear and non-linear op-
timization techniques have been performed to overcome
the curse of dimensionality in real-world applications.
However, most of these efforts have included different
varieties of simplifications and approximations, which
usually make the operating policies inefficient in prac-
tice. Using optimization techniques along with simula-
tion, such as Reinforcement Learning (RL) techniques,
could be a suitable alternatives for this purpose. RL
is a powerful and well-known technique in machine
learning research to cope well with many optimization
and simulation problems. It is also called Simulation-
Based Dynamic Programming [I] in which a decision
maker (agent) optimizes an objective function through
interacting with deterministic or stochastic environments
These interactions might cause some instant reward or
punishment which are accumulated during the training
proc ess and called action value functions These values
are the basis for the agent to take proper actions in
different situations (states). Based on what has been
proven by Watkins [2] these values converge to steady
states if each action state pair is visited for infinite
number of times - practically multiple times; however,
this may take too much time in real-world applications.
Therefore, the question may come up how to achieve
an optimal solution with fewer interactions. Opposition-
Based Learning (OBL) scheme, which is firstly intro-
duced by Tizhoosh [3], could be a suitable answer to
the mentioned question. Tizhoosh has shown that using
this scheme in some soft computing methods such as
Genetic Algorithms (GA), Neural Networks (NN), and
Reinforcement
Learning (RL) can
generally speed up the
training process. However, this is completely problem
dependent. He also used this scheme with Reinforce-
ment Learning (RL) in finding a path to a fixed goal
in discrete grid worlds of different sizes [3]. In this
specific example, an agent takes an action in the current
state and updates the respective action-value function
in addition to those functions, which are related to
opposite actions or states. The criterion for granting
reward or punishment to an agent is the distance to a
fixed goal inside the grid. Moreover, the environment
under the
study
in this case study is totally deterministic.
In this paper, we investigate the effect of opposition-
Based Learning (OBL) scheme using Q-learning method
for the reservoir management. To show that this scheme
is efficient, it is applied on a single reservoir problem
which is completely stochastic in terms of inflow to
reservoir. Therefore we can easily find the optimal
or near-optimal policies and performances by regular
Q-Learning and simulation in a reasonable time. The
corresponding results would consequently be extended
for multi-reservoir applications in the future research.
The paper is orranized as follows- In the next section,
a simple model of a single reservoir will be explained.
Section 3 will provide a general review of Stochastic
Dynamic Programming (SDP). In section 4, Q-Learning
will be briefly reviewed. Some basic concepts of OBL
and a version of opposition-based algorithm using Q-
Learning in the reservoir management will be described
in section 5 and 6. Finally, in sections 7 and 8, some
experimental results and conclusions will be provided.
1-4244-0706-0/07/$20.00 ©2007 IEEE 217