Intuitive Action Set Formation in Learning Classiﬁer Systems with Memory Registers L. Sim ˜ oes and M.C. Schut and E. Haasdijk 1 Abstract. An important design goal in Learning Classiﬁer Systems (LCS) is to equally reinforce those classiﬁers which cause the level of reward supplied by the environment. In this paper, we propose a new method for action set formation in LCS. When applied to a Zeroth Level Classiﬁer System with Memory registers (ZCSM), our method allows the distribution of rewards among classiﬁers which result in the same memory state, rather than those encoding the same memory update action. 1 INTRODUCTION This paper introduces a new method for action set formation (asf ) in Learning Classiﬁer Systems, and tests it in partially observable environments requiring memory. The operation of asf is responsi- ble for choosing the classiﬁers that will receive the reward supplied by the environment, for some performed action. When new classi- ﬁers are generated, the system has no way of knowing how good these are. Their strengths depend on the actions in the contexts under which they trigger, and on the other classiﬁers in the population with which they interact. As classiﬁers are added to the population, these are assigned an initial strength value. Then, by repeated usage, the strength update component will gradually converge towards a better estimate of their qualities. But since the system has to perform at the same time it is building its rule base, it is forced to act despite its un- certainty about the environment, and selecting from among an ever changing population of insufﬁciently tested classiﬁers. The method introduced here, iasf, eliminates some of the noise to which the qual- ity estimation component is subjected, with the goal of improving system performance. 2 BACKGROUND In the mid-1990s, Wilson [7] proposed ZCS as a simpliﬁcation of Holland’s original LCS [3]. Most importantly, he left out the mes- sage list which acted as memory in the original system. Thus, Wil- son’s models had no way of remembering previously encountered states and could not perform optimally in partially observable envi- ronments where an agent can ﬁnd itself in a state that is indistin- guishable from another state. However, the best action to undertake is not necessarily the same in both states. Wilson proposed [7] a so- lution for this problem in the form of memory registers to extend the classiﬁers. Cliff & Ross [2] follow this suggestion and implement ZCSM, extending ZCS with a memory mechanism. In their exper- iments they observed that ZCSM can efﬁciently exploit memory in partially observable environments. 1 Department of Computer Science, Faculty of Sciences, VU University, Am- sterdam, The Netherlands, email: {lfms, mc.schut, e.haasdijk}@few.vu.nl Stone & Bull extensively compared ZCS to the more popular XCS in noisy, continuous-valued environments [6] and found that what makes XCS so good in deterministic environments (namely; its at- tempt to build a complete, maximally accurate and maximally gen- eral map of the payoff landscape) becomes a disadvantage as the level of noise in the environment increases. ZCS’s partial map, focusing on high-rewarding niches in the payoff landscape then becomes an advantage. This suggests ZCS as an adaptive control mechanism in multi-step, partially observable, stochastic real-world problems. 3 INTUITIVE ACTION SET FORMATION ZCS works on a population P of rules which together present a so- lution to the problem with which the system is faced. As it interacts with the environment, the system is triggered on reception of a sen- sory input. A match set M is then formed with all the rules in the population matching that input. From this set, a classiﬁer is chosen by proportionate selection based on its strength, and its action is ex- ecuted. With memory added as described in [2], rules prescribe an external action as well as a modiﬁcation of the memory bits. It can be argued that the core of ZCS lies in the next, reinforcement stage, as it is responsible for incrementally learning the quality of the rules in the population, which will in turn determine the system’s behaviour. The action set A includes those rules in M that advocated the same action as the chosen classiﬁer. The rules in this action set share in the reward that results from the selected action (with the rationale that choosing any of those rules would have had the same effect). Rules in M that advocate a different action are penalised. Traditionally, A consists of those rules in M that match on a bit- wise comparison with the action-part of the chosen classiﬁer. Now, consider ZCSM, where operators on the memory state are added to the action part of the rules. Suppose, then, a situation where the memory state was 01, and remains the same after execution of some chosen classiﬁer c, which advocated 2 [0#]. Traditional action set formation would then have A include only those classiﬁers from M advocating this same memory operation (“set the ﬁrst memory reg- ister to 0”) as well as the same external action as the chosen clas- siﬁer. However, all of the internal actions {##,#1,01} would result in exactly the same internal state. Not only would the system not re- ward any classiﬁer in M having one of those internal actions (and the same external action) as the chosen classiﬁer, it would actually penalise them. This seems to conﬂict with ZCS’s goal of equally re- warding those classiﬁers which would cause the same level of reward supplied by the environment. 2 Disregarding the external output for simplicity. ECAI 2008 M. Ghallab et al. (Eds.) IOS Press, 2008 © 2008 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-58603-891-5-761 761