IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 3, JUNE 2005 289
Combination of Online Clustering and Q-Value Based
GA for Reinforcement Fuzzy System Design
Chia-Feng Juang, Member, IEEE
Abstract—This paper proposes a combination of online clus-
tering and Q-value based genetic algorithm (GA) learning scheme
for fuzzy system design (CQGAF) with reinforcements. The
CQGAF fulfills GA-based fuzzy system design under reinforce-
ment learning environment where only weak reinforcement signals
such as “success” and “failure” are available. In CQGAF, there
are no fuzzy rules initially. They are generated automatically. The
precondition part of a fuzzy system is online constructed by an
aligned clustering-based approach. By this clustering, a flexible
partition is achieved. Then, the consequent part is designed by
Q-value based genetic reinforcement learning. Each individual
in the GA population encodes the consequent part parameters
of a fuzzy system and is associated with a Q-value. The Q-value
estimates the discounted cumulative reinforcement information
performed by the individual and is used as a fitness value for GA
evolution. At each time step, an individual is selected according to
the Q-values, and then a corresponding fuzzy system is built and
applied to the environment with a critic received. With this critic,
Q-learning with eligibility trace is executed. After each trial, GA
is performed to search for better consequent parameters based on
the learned Q-values. Thus, in CQGAF, evolution is performed
immediately after the end of one trial in contrast to general GA
where many trials are performed before evolution. The feasibility
of CQGAF is demonstrated through simulations in cart-pole bal-
ancing, magnetic levitation, and chaotic system control problems
with only binary reinforcement signals.
Index Terms—Flexible partition, fuzzy control, Q-learning, rein-
forcement learning, temporal difference.
I. INTRODUCTION
T
HE application of fuzzy systems in different areas is an
active research topic in recent years. In general, the per-
formance of a fuzzy system critically depends on the fuzzy
system rules and input–output membership functions. How-
ever, it is not always easy or possible to extract these informa-
tion from the experts’ priori knowledge. To cope with this dif-
ficulty, many researchers have been working to find automatic
methods for fuzzy system design. These automatic methods
may be divided into two categories, viz., design based on su-
pervised learning and that based on reinforcement learning.
For many problems, the training data are usually difficult and
expensive to obtain and may not be readily available. For these
problems, reinforcement learning is more suitable than super-
vised learning. In reinforcement learning, an agent receives
from its environment a critic, called reinforcement, which can
Manuscript received September 11, 2002; revised August 29, 2003 and
September 7, 2004. This work was supported by the National Science Council,
Taiwan, R.O.C., under Grant NSC 92-2213-E-005-001.
The author is with the Department of Electrical Engineering, National Chung
Hsing University, Taichung, 402 Taiwan, R.O.C.
Digital Object Identifier 10.1109/TFUZZ.2004.841726
be thought of as a reward or a punishment. The agent is not
told which actions to take, but instead must discover which
actions yield the most reward by trying them. To solve re-
inforcement learning problem, the most popular approach is
temporal difference (TD) method [1]–[3]. Two TD-based rein-
forcement learning approaches have been proposed: The adap-
tive heuristic critic (AHC) [4], [5] and Q-learning [6], [7]. In
AHC, there are two separate networks: An action network and
an evaluation network. Based on the AHC, in [4], the authors
used two neuron-like elements to solve the cart-pole balancing
problem. In this approach, the input space is discretized into
nonoverlapping regions or “boxes,” and there is no gener-
alization between each box. In contrast to the single-layer
AHC in [4], a two-layer AHC is used in [5]. In two-layer
AHC, discretization of the input space is not necessary since
the additional hidden layer allows the network to represent
any nonlinear discriminative function. In [8], a generalized
approximate reasoning-based intelligent control (GARIC) is
proposed, in which a two-layer feedforward neural network is
used as an action evaluation network and a fuzzy inference
network is used as an action selection network. The GARIC
provides generalization ability in the input space and extends
the AHC algorithm to include the prior control knowledge of
human operators. In [9], a reinforcement neural-network-based
fuzzy logic control system (RNN-FLCS) is proposed, where
both the action and evaluation networks are constructed by
neural fuzzy networks. One drawback of these actor-critic ar-
chitectures is that they usually suffer from the local minimum
problem in network learning due to the use of gradient descent
learning method.
Besides the aforementioned AHC algorithm-based learning
architecture, more and more advances are being dedicated to
learning schemes based on Q-learning. Q-learning collapses
the two measures used by actor/critic algorithms in AHC into
one measure referred to as the Q-value. It may be consid-
ered as a compact version of the AHC, and is simpler in
implementation. In some case studies [10], [11], the superi-
ority of Q-learning over AHC has been demonstrated. Some
Q-learning based reinforcement learning structures have also
been proposed [12]–[17]. In [12], a dynamic fuzzy Q-learning
is proposed for fuzzy inference system design. In this method,
the consequent parts of fuzzy rules are randomly generated
and the best rule set is selected based on its corresponding
Q-value. The problem in this approach is that if the optimal
solution is not present in the randomly generated set, then the
performance may be poor. In [14] and [15], fuzzy Q-learning
is applied to select the consequent action values of a fuzzy
inference system. For these methods, the consequent value is
1063-6706/$20.00 © 2005 IEEE