1 特集論文 「エージェント技術とその応用 2021 Reward Design for Multi-Agent Reinforcement Learn- ing with a Penalty Based on the Payment Mechanism Natsuki Matsunami Department of Computer Science, Nagoya Institute of Technology matsunami.natsuki@itolab.nitech.ac.jp Shun Okuhara Department of Social Informatics, Kyoto University okuhara@i.kyoto-u.ac.jp Takayuki Ito (affiliation as previous author) ito@i.kyoto-u.ac.jp keywords: multi-agent reinforcement learning, mechanism design, Vickrey-Clarke-Groves mechanism Summary In this paper, we propose a novel method of reward design for multi-agent reinforcement learning (MARL). One of the main uses of MARL is building cooperative policies between self-interested agents. We take inspiration from the concept of mechanism design from game theory to modify how agents are rewarded in MARL algorithms. We defined the payment that reflects the negative contribution to other agents’ valuation in the same manner as the Vickrey-Clarke-Groves (VCG) mechanism. We give the individual learning agent a reward signal that consists of two elements. One is a reward evaluated solely on the basis of individual behavior that will follow a greedy and selfish policy, and the other is a negative reward as a penalty evaluated on the basis of the payment that will reflect the negative contribution to social welfare. We call this scheme reward design for MARL based on the payment mechanism (RDPM). We experimented with RDPM in two different scenarios. We show that RDPM can increase the social utility among agents while the other reward designs achieve far less, even for basic and simplistic problems. We finally analyze and discuss how RDPM affects the building of a cooperative policy. 1. Introduction In artificial intelligence, an agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators [Russell 09]. In the real world, there are few cases in which only one agent exists in an environment independently. Gener- ally, agents need to consider each other in most real world problems. However, natural consideration of other agents is difficult to implement with a program in advance due to its complexity [Bu 08]. In many complex domains, rein- forcement learning is the only feasible way to train a pro- gram to perform at high levels [Russell 09]. Multi-agent reinforcement learning (MARL) aims to solve a coopera- tive task, the team goal is desired to be achieved by agents voluntarily. In other words, agents should select actions to achieve the team goal in a decentralized manner; there- fore, agents have to cooperate, and sometime even sacri- fice themselves for other agents, if it could increase the overall social utility. However, a cooperative policy is dif- ficult to develop among agents even when one agent can observe the other agents’ states or actions without a cen- tralized system. In this situation, reward is a key signal to lead learning agents to the desired cooperative policy in MARL. Therefore, we focus on this problem and propose a new reward design for MARL. In mechanism design (MD), there is a lot of literature on cooperation and consideration of other agents from an economics perspective [Groves 73]. In MD theory, we de- sign rules in which an agent participates in a mechanism, tries to behave in a way that will obtain higher utility, and produces a good result when making a decision with other agents as a group. In this regard, a strong relationship arises between reward design as an incentive mechanism for a cooperative task and the credit assignment of each agent’s contribution. In this paper, we propose a new method of reward de- sign for MARL based on MD. The Vickrey-Clarke-Groves (VCG) mechanism is a well-known payment mechanism [Vidal 06], and we borrow its idea for evaluating each agent’s contribution to the whole. The payments in VCG measure an agent’s contribution by the difference in the sums of values determined by other agents for when a target agent does or does not exist. In our proposition,