Limiting Games of Multi-agent Multi-state Problems Peter Vrancx Computational Modeling Lab Vrije Universiteit Brussel Brussels, Belgium pvrancx@vub.ac.be Katja Verbeeck Computational Modeling Lab Vrije Universiteit Brussel Brussels, Belgium kaverbee@vub.ac.be Ann Nowe Computational Modeling Lab Vrije Universiteit Brussel Brussels, Belgium ann.nowe@vub.ac.be ABSTRACT We propose to analyse the behaviour of learning agents in a multi-state environment by approximating the problem with a limiting single state game. The limiting game views each joint agent policy as a single play between players using the agents’ policies as their actions. The payoff given to each player is the expected reward for the corresponding agent under the resulting joint policy. In the settings we explore agents are fully ignorant, i.e. they can only observe them- selves; they dont know how many other agents are present in the environment, the actions these other agents took, the rewards they received for this, nor the location they occupy in the state space. We compare 2 reinforcement learning algorithms, i.e. learning automata and Q-learning and show experimentally that in spatial coordination problems under study the automata converge to a nash equilibrium in the limiting game. 1. INTRODUCTION Analysing the behaviour of multi-agent Reinforcement learn- ing (MARL) algorithms is usually limited to single state en- vironments, modelled as normal form games. In this paper, we are concerned with the behaviour of multi-agent learning in multi-state environments or Markov Games. One way to do this is by translating the multi state problem to a limiting single state game and explain the behaviour of the multi- agent learning technique in terms of equilibrium points in this limiting game. The limiting game of a corresponding multi-agent multi-state problem can be defined as follows: each joint agent policy is viewed as a single play between players using the agents’ policies as their individual actions. The payoff given to each player is the expected reward for the corresponding agent under the resulting joint policy. Limiting games were recently used to study the behaviour of a network of learning automata learning in Markov Games funded by a Ph.D grant of the Institute for the Promotion of Innovation through Science and Technology in Flanders (IWT Vlaanderen). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. [5]. A learning automaton describes the internal state of an agent as a probability distribution according to which ac- tions should be chosen [4]. These probabilities are adjusted with some reinforcement scheme according to the success or failure of the actions taken. This form of reinforcement learning, which has its roots in psychology, can be viewed as hill-climbing in probability space. Important to note is that these LA update schemes work strictly on the basis of the response of the environment, and not on the basis of any knowledge regarding other automata, i. e. nor their strate- gies, nor their feedback. It was shown that in the set-up of [5] the network of independent learning automata is able to reach equilibrium strategies in Markov Games [5] pro- vided some common ergodic assumptions are fulfilled. In this paper we elaborate further on this result. The original model assumes that in each state of the Markov game, each agent is represented by a learning automaton. Furthermore, the agents are assumed to recognise their current state of the Markov Game. Here we relax this assumption by stat- ing that agents only have partial observability; they do not observe the location of other agents acting in the same en- vironment, they only see their own location. The LA set-up proposed in this paper therefore only puts for every agent a learning automaton at each location of the environment instead of at every state of the Markov Game. We empiri- cally show that the convergence result still holds under the spatial coordination problems under study. We start our theoretical analysis in a simple environment with only 2 locations, but we also add our results on larger grid world simulations. Comparatively, we also conducted our experiments with independent Q-learners. Surprisingly, we see that in small problems, independent Q-learners, who only see their own locations and rewards are also able to reach equilibrium points of the limiting game. This paper is organized as follows: the next section de- scribes the tools we use for analysing multi-agent RL al- gorithms. Section 3 presents the LA update mechanism in case agents cannot observe the current joint state. Section 4 reports the experimental results which are further discussed in section 5. 2. TOOLS In this section we describe the necessary tools for our anal- ysis. 2.1 Markov games The underlying framework for modelling multi-agent Re- inforcement Learning problems is given by the framework