IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 13, NO. 3, JUNE 2005 289 Combination of Online Clustering and Q-Value Based GA for Reinforcement Fuzzy System Design Chia-Feng Juang, Member, IEEE Abstract—This paper proposes a combination of online clus- tering and Q-value based genetic algorithm (GA) learning scheme for fuzzy system design (CQGAF) with reinforcements. The CQGAF fulfills GA-based fuzzy system design under reinforce- ment learning environment where only weak reinforcement signals such as “success” and “failure” are available. In CQGAF, there are no fuzzy rules initially. They are generated automatically. The precondition part of a fuzzy system is online constructed by an aligned clustering-based approach. By this clustering, a flexible partition is achieved. Then, the consequent part is designed by Q-value based genetic reinforcement learning. Each individual in the GA population encodes the consequent part parameters of a fuzzy system and is associated with a Q-value. The Q-value estimates the discounted cumulative reinforcement information performed by the individual and is used as a fitness value for GA evolution. At each time step, an individual is selected according to the Q-values, and then a corresponding fuzzy system is built and applied to the environment with a critic received. With this critic, Q-learning with eligibility trace is executed. After each trial, GA is performed to search for better consequent parameters based on the learned Q-values. Thus, in CQGAF, evolution is performed immediately after the end of one trial in contrast to general GA where many trials are performed before evolution. The feasibility of CQGAF is demonstrated through simulations in cart-pole bal- ancing, magnetic levitation, and chaotic system control problems with only binary reinforcement signals. Index Terms—Flexible partition, fuzzy control, Q-learning, rein- forcement learning, temporal difference. I. INTRODUCTION T HE application of fuzzy systems in different areas is an active research topic in recent years. In general, the per- formance of a fuzzy system critically depends on the fuzzy system rules and input–output membership functions. How- ever, it is not always easy or possible to extract these informa- tion from the experts’ priori knowledge. To cope with this dif- ficulty, many researchers have been working to find automatic methods for fuzzy system design. These automatic methods may be divided into two categories, viz., design based on su- pervised learning and that based on reinforcement learning. For many problems, the training data are usually difficult and expensive to obtain and may not be readily available. For these problems, reinforcement learning is more suitable than super- vised learning. In reinforcement learning, an agent receives from its environment a critic, called reinforcement, which can Manuscript received September 11, 2002; revised August 29, 2003 and September 7, 2004. This work was supported by the National Science Council, Taiwan, R.O.C., under Grant NSC 92-2213-E-005-001. The author is with the Department of Electrical Engineering, National Chung Hsing University, Taichung, 402 Taiwan, R.O.C. Digital Object Identifier 10.1109/TFUZZ.2004.841726 be thought of as a reward or a punishment. The agent is not told which actions to take, but instead must discover which actions yield the most reward by trying them. To solve re- inforcement learning problem, the most popular approach is temporal difference (TD) method [1]–[3]. Two TD-based rein- forcement learning approaches have been proposed: The adap- tive heuristic critic (AHC) [4], [5] and Q-learning [6], [7]. In AHC, there are two separate networks: An action network and an evaluation network. Based on the AHC, in [4], the authors used two neuron-like elements to solve the cart-pole balancing problem. In this approach, the input space is discretized into nonoverlapping regions or “boxes,” and there is no gener- alization between each box. In contrast to the single-layer AHC in [4], a two-layer AHC is used in [5]. In two-layer AHC, discretization of the input space is not necessary since the additional hidden layer allows the network to represent any nonlinear discriminative function. In [8], a generalized approximate reasoning-based intelligent control (GARIC) is proposed, in which a two-layer feedforward neural network is used as an action evaluation network and a fuzzy inference network is used as an action selection network. The GARIC provides generalization ability in the input space and extends the AHC algorithm to include the prior control knowledge of human operators. In [9], a reinforcement neural-network-based fuzzy logic control system (RNN-FLCS) is proposed, where both the action and evaluation networks are constructed by neural fuzzy networks. One drawback of these actor-critic ar- chitectures is that they usually suffer from the local minimum problem in network learning due to the use of gradient descent learning method. Besides the aforementioned AHC algorithm-based learning architecture, more and more advances are being dedicated to learning schemes based on Q-learning. Q-learning collapses the two measures used by actor/critic algorithms in AHC into one measure referred to as the Q-value. It may be consid- ered as a compact version of the AHC, and is simpler in implementation. In some case studies [10], [11], the superi- ority of Q-learning over AHC has been demonstrated. Some Q-learning based reinforcement learning structures have also been proposed [12]–[17]. In [12], a dynamic fuzzy Q-learning is proposed for fuzzy inference system design. In this method, the consequent parts of fuzzy rules are randomly generated and the best rule set is selected based on its corresponding Q-value. The problem in this approach is that if the optimal solution is not present in the randomly generated set, then the performance may be poor. In [14] and [15], fuzzy Q-learning is applied to select the consequent action values of a fuzzy inference system. For these methods, the consequent value is 1063-6706/$20.00 © 2005 IEEE