Signal Instructed Coordination in Team Competition Liheng Chen 1,2† , Hongyi Guo 1 , Haifeng Zhang 3‡ , Fei Fang 4♯ , Yaoming Zhu 1 , Ming Zhou 1 , Weinan Zhang 1 , Qing Wang 5* , Yong Yu 1 1 Shanghai Jiao Tong University 2 Tencent AI Lab 3 University College London 4 Carnegie Mellon University 5 Huya AI † clhbob@sjtu.edu.cn ‡ haifeng.zhang@ucl.ac.uk ♯ feif@cs.cmu.edu Abstract Most existing models of multi-agent reinforcement learning (MARL) adopt centralized training with decentralized ex- ecution framework. We demonstrate that the decentralized execution scheme restricts agents’ capacity to ﬁnd a better joint policy in team competition games, where each team of agents share the common rewards and cooperate to compete against other teams. To resolve this problem, we propose Sig- nal Instructed Coordination (SIC), a novel coordination mod- ule that can be integrated with most existing models. SIC casts a common signal sampled from a pre-deﬁned distribu- tion to team members, and adopts an information-theoretic regularization to encourage agents to exploit in learning the instruction of centralized signals. Our experiments show that SIC can consistently improve team performance over well- recognized MARL models on matrix games and predator- prey games. Introduction Multi-agent systems (Lowe et al. 2017) are common in many real-world scenarios, e.g., complex games and social dilem- mas. Recently, there is a growing interest in multi-agent reinforcement learning (MARL) where learning paradigms are proposed to apply reinforcement learning algorithms in multi-agent systems. A straightforward approach is to adopt a fully centralized method which regards all agents as one and apply successful single-agent reinforcement learning algorithms. However, the fully centralized method suffers from exponential growth of the size of the joint action space with the number of agents. On the contrary, the fully de- centralized method models each participant as an individ- ual agent with its own policy and critic. This setting fails to solve the non-stationary environment problem (Lanctot et al. 2017; Matignon, Laurent, and Le Fort-Piat 2012), and is em- pirically deprecated by (Foerster et al. 2016; Li 2018). An al- ternative paradigm between them is centralized training with decentralized execution (Oliehoek, Spaan, and Vlassis 2008; Lowe et al. 2017), where each agent learns an individual pol- icy and a centralized critic. During the training stage, the * The work was done while the author was at Tencent AI Lab. Copyright c  2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. centralized critic conditions its value estimation on joint ob- servations and actions of all agents, and only decentralized policies are used in the decision making stage. This train- ing paradigm bypasses the non-stationary problem , and is adopted by a bunch of recent models (Lowe et al. 2017; Foerster et al. 2018b; Das et al. 2018; Iqbal and Sha 2018; Rashid et al. 2018; Son et al. 2019). We argue that in a common variant of mixed cooperative- competitive environments, team competition games, where two teams of agents cooperate to compete against each other, the popular decentralized decision making scheme restricts agents’ ability to ﬁnd a better joint policy. In team compe- tition, all agents within a team share a common reward that is opposite to that of the other team, and both teams aim to reach Nash equilibria with higher expected returns. We demonstrate that in a decentralized scheme, a team of agents can only explore strategies in a joint policy space smaller than the meta-policy space in the fully cooperative approach, and consequently, miss the chance to reach better Nash equi- libria with higher returns. Therefore, it is meaningful to ﬁnd a better algorithm that maintains the decentralized scheme framework for ease of training and enables agents to explore in larger policy space for higher game values in competition. In this paper, we propose Signal Instructed Coordination (SIC), a novel plug-in module for learning coordinated poli- cies in the centralized training with decentralized execution paradigm. For each team, SIC samples a common signal from a pre-deﬁned distribution, and casts it to all team mem- bers to coordinate their decentralized policies. As all agents in a team receive the same signal, they are capable of in- ferring policies of other teammates and make decisions ac- cordingly. Theoretically, when the space of possible signals is sufﬁciently large, the agents in a team can achieve per- fect coordination, i.e., behave like a fully centralized agent, as the signal can implicitly designate the actions each agent should take. Therefore, this signal can extend the joint policy space to the same as that in a fully centralized approach. To encourage agents to follow instructions of the signal, we in- troduce an information-theoretic regularization, which max- imizes the mutual information between signal variables and joint policies. Our SIC can be easily integrated with most existing mod- arXiv:1909.04224v1 [cs.MA] 10 Sep 2019