IEICE TRANS. INF. & SYST., VOL.E88–D, NO.1 JANUARY 2005 127 PAPER Maintaining System State Information in a Multiagent Environment for Effective Learning Gang CHEN †a) , Student Member, Zhonghua YANG †b) , Hao HE †† , and Kiah-Mok GOH †† , Nonmembers SUMMARY One fundamental issue in multiagent reinforcement learn- ing is how to deal with the limited local knowledge of an agent in order to achieve effective learning. In this paper, we argue that this issue can be more effectively solved if agents are equipped with a consistent global view. We achieve this by requiring agents to follow an interacting protocol. The properties of the protocol are derived and theoretically analyzed. A distributed protocol that satisfies these properties is presented. The exper- imental evaluations are conducted for a well-known test-case (i.e., pursuit game) in the context of two learning algorithms. The results show that the protocol is effective and the reinforcement learning algorithms using it perform much better. key words: multiagent system, system state, distributed protocol, token ring 1. Introduction The agent reinforcement learning problem was originally studied in a single agent setting, and has been extended to multi-agent systems [3], [10]. The problem is typically mod- eled as a Markov Decision Process (MDP). The overall goal of an agent is to learn a policy in order to maximize the long- term performance. A policy defines a mapping from system states to actions, and the performance is measured in terms of the accumulation of discounted rewards. The MDPs are extended to multi-agent MDPs (MAMDPs) to allow concurrent learning of multiple agents [8]. A MAMDP of m agents is defined as a tuple of the form M = (S , A m , δ, γ m ), where S is a set of potential system states, A m is a set of possible actions executable by the m agents, δ is a system transition function, and γ is a set of reward functions for the m agents. A system state is considered as a signal from the envi- ronment [10] where the agents reside in and interact with through performing actions. Obviously, there are many types of information from the environment. However, we only consider as a system state those information that satis- fies the Markov property. The Markov property states that the next state of a system is determined solely by the current state of the system and actions taken by agents in this state, that is, δ : S × A m → S . Since the behavior of a MAS system is determined by Manuscript received September 16, 2003. Manuscript revised June 16, 2004. † The authors are with the Information Communication Insti- tute of Singapore, School of Electrical and Electronic Engineering, Nanyang Technological University, 639798 Singapore. †† The authors are with the Singapore Institute of Manufacturing Technology, 638075 Singapore. a) E-mail: pg02463664@ntu.edu.sg b) E-mail: eZhYang@ntu.edu.sg the whole group of autonomous agents who act indepen- dently and concurrently, the current system state and an agent’s discounted reward are not only dependent upon the agent’s own policy, but also affected by policies of other agents. Clearly, in a MAS setting, an agent may only ob- serve a partial system state, and thus the introduction of the MAMDP (and Markov property) poses one severe challenge as to how an agent obtains and maintains the current system state. The case where an agent only observes the partial sys- tem state is modeled as the Partially Observable Markov De- cision Process (POMDP) [4]. A POMDP problem is hard to solve even in the situation of a single learning agent. Within a multiagent setting, it was often approached under stringent assumptions [8], but no general theoretical results were re- ported in this regards. In this paper, we present our approach to obtaining a global view which is consistent with the most updated system state. Our approach attempts to ensure the Markov property of a MAS system. A token-ring-based dis- tributed protocol is presented which satisfies three proper- ties. These properties are theoretically analyzed to guaran- tee a global system view that respects the Markov property. It has been shown that the algorithms that assume the fully observable system states are both easily understandable and theoretically guaranteed to be effective [5], [6]. While be- ing devised for providing an effective multi-agent learning environment, our protocols are expected to find their appli- cations in other contexts where most updated system-level information of a distributed system is required. 2. System Model In a multiagent system, an agent fulfills its missions by ex- ecuting actions. When an agent starts or finishes the exe- cution of an action, we say that a corresponding event e is fired. An event e that occurs at time t is denoted by a tuple (e, t). We assume that there exists a synchronized global time clock with an adequate resolution for timestamping events. The history at time t, denoted as h t , is defined as {(e, t ′ ) |t ′ ≤ t }. The firing of an event is considered as the re- sult of the decisions made by an agent. We require that the agents make their decisions based on the information with the desired Markov property. In other words, R1 whenever an event fires, the corresponding agent must have the most updated information of the system state. A multiagent system π at time t is defined as a tuple: Copyright c 2005 The Institute of Electronics, Information and Communication Engineers