Learning to cooperate in multi-agent systems by combining Q-learning and evolutionary strategy Mary McGlohon and Sandip Sen mary-mcglohon@utulsa.edu, sandip@utulsa.edu Department of Computer Science The University of Tulsa, Tulsa, Oklahoma Abstract Coordination games can represent interactions between multiple agents in many real-life situations. Thus single-stage coordina- tion games provide a stylized, abstracted environment for test- ing algorithms that allow artiﬁcial agents to learn to cooperate in such settings. Individual reinforcement learners often fail to learn coordinated behavior. Using an evolutionary approach to strategy selection can produce optimal joint behavior but may require signiﬁcant computational eﬀort. Our goal in this paper is to improve convergence to optimal behavior with reduced com- putational eﬀort by combining learning and evolutionary tech- niques. In particular, we show that by letting agents learn in between generations in an evolutionary algorithm allows them to more consistently learn eﬀective cooperative behavior even in diﬃcult, stochastic environments. Our combined mechanism is a novel improvisation involving selecting actual rather than inherited behaviors. 1 Introduction Reinforcement learning and evolutionary computation are active topics in the area of autonomous agents due both to their generic applicability and to the aesthetic appeal of their similarities to biological systems. On the other hand, coordination in games has applications in the behavioral and social sciences. In this paper, we use instances of ma- chine learning and adaptation techniques, Q-learning and evolutionary strategy, to learn to solve cooperative games. Many games are modeled as matrices representing dif- ferent action choices available to players. In coordination games, actions along the diagonal of the matrix have higher payoﬀs than the other actions. Our work uses a special kind of coordination game, that which gives equivalent payoﬀs to both players given an action combination. When the games are repeated, it is beneﬁcial for an agent to learn which of its own action choices will produce the highest payoﬀ, given the opponent’s policy. Traditional, single-agent machine learning algorithms ap- plied to cooperative systems are not guarateed to con- verge to the optimal action combination. Without suﬃ- cient exploration, an agent can get “stuck in a rut” and choose nonoptimal actions because the payoﬀ seems good enough. Also, if the domain is nondeterministic, an agent may get discouraged with infrequent, but signiﬁcantly low payoﬀs for an action, even when choosing that action is op- timal [3, 5]. Therefore, using an algorithm with a proper balance of exploration and exploitation is desirable. In our work, we have designed a combined evolutionary and reinforcement learning approach to solve some partic- ular coordination games. A population of agents plays a game for several iterations, and the most successful agents pass phenotypic information on to the oﬀspring. The oﬀ- spring then improve on the parents’ policy through a rein- forcement learning algorithm. This will often cause later generations to converge to optimal behavior in the coordi- nation game. 2 Related Work 2.1 Single-stage coordination games Single-stage coordination games have been used as a model for studying coordination in multi-agent systems [3, 5]. A stage game consists of a set α of n agents, where each agent i ∈ α has a set A i of individual actions. The agents simula- teously choose actions from their respective A i s each time they play the stage game. A set of payoﬀ matrices, R i , spec- ify the payoﬀ to each player for each possible action combi- nation. In coordination games payoﬀs are highest along the diagonal; in the games in this work, all players will also re- ceive the same payoﬀ for each action combination. We use π i to denote the strategy proﬁle for agent i, where π i (a) is the probability of choosing action a ∈ A i by the agent i in repeated play of stage games.  denotes the product of the individual strategies. If each π i is deterministic, i.e., π i (a) = 1 for some a ∈ A i , then  is a joint action [3, 5]. Multiagent learning schemes attempt to converge to coor- dinated actions. The single-stage coordination games used in this paper 1