Reinforcement Learning of Communication in a Multi-Agent Context Shirley Hoet LIP6 Pierre et Marie Curie University Paris, France shirley.hoet@lip6.fr Nicolas Sabouret LIP6 Pierre et Marie Curie University Paris, France nicolas.sabouret@lip6.fr Abstract—In this paper, we present a reinforcement learning approach for multi-agent communication in order to learn what to communicate, when and to whom. This method is based on introspective agents that can reason about their own actions and data so as to construct appropriate communicative acts. We propose an extension of classical reinforcement learning algorithms for multi-agent communication. We show how com- municative acts and memory can help solving non-markovity and asynchronism issues in MAS. Keywords-Communication Learning;Reinforcement Learn- ing; Multi-Agent System I. I NTRODUCTION Usually, indirect communication in MAS, it is assumed that agents know when to send a message, the type of com- municative act they must use, the content of the message and the recipient(s) of the message. However, these hypothesis do no longer hold in open and heterogeneous MAS. Thus, agents must learn what to communicate, to who and when. In this paper, we focus on one single learner agent, learn- ing request and query messages. Much has been done in the field of mono-agent behaviour learning, in particular using reinforcement learning. However, existing techniques have several limits when it comes to learning to communicate in MAS. First, since the MAS is open and loosely coupled, the learner agent has no idea of the preconditions and effects of other agent’s abilities. Yet, to delegate a task to an other agent (and to build relevant request messages), it must determine the context and possible actions of each agent in the system. Second, the environment is only partially observable by the learner agent. There exist techniques to learn good behaviours in a partially observable environment[1], [2]. However they require that the agent knows the model of the environment, such as its state space and/or the tran- sition probability from a state to another given the action performed by the agent. If the agent evolves in an open and loosely coupled MAS, it has no access to this information. Another approach consists in separating two hidden states of the environment either with the agent’s memory [3], [4] or with information obtained by communication [5], [6]. In our approach, we will use communication to discover hidden states and store them in the agent’s memory. Third, when agents interact to delegate tasks (through request acts) in asynchronous MAS, a requested action can be executed several time steps after the answer was sent and received. Thus, the state of the system at time t, as seen by the learner agent, can depend on tasks that were delegated at time t - k. This supports the idea that learner agents should store their past delegated actions whose effect could be delayed. However, adding such a memory increases the learner agent’s state space in an exponential manner, which will prevent the reinforcement learning to converge. Furthermore, the learner agent will have to learn to wait for the delegated action to be performed before executing another. In the following section, we present our solution for the first two issues: using simple MAS protocols in the context of introspective agents allows us to discover possible interactions. In section III, we present an iterative approach to building a memory for solving the third problem (related to the agent asynchrony). We discuss our evaluation results in section IV and related work on multi-agent interaction learning in section V. Finally we conclude in section VI. II. BUILDING MESSAGES In our model, we use the VDL multi-agent platform [7] which proposes an agent communication language (ACL) based on the FIPA ACL model, extended with speech acts for introspection. The VDL model assumes that agents have access at runtime to the list of their capacities (with their preconditions and effects) and internal state. Thus, a VDL agent can answer questions such as “what can you do now?”. This capability is used to extract the capacities of the learner agent’s peer, and, thus, to determine the content of future request and query messages. Our model uses two interaction protocols: The what-query protocol allows agents to discover what they can ask to other agents about their internal state and, thus, to build eligible query message. Discovered query messages lead to inform answers. The content of these inform message is used to store beliefs in the learner agent’s state.