Partial Local FriendQ Multiagent Learning: Application to Team Automobile Coordination Problem 1 Julien Laumonier and Brahim Chaib-draa 2 1 Introduction In real world cooperative multiagent problems, each agent has of- ten a partial view of the environment. If communication has a cost, the multiagent system designer has to ﬁnd a compromise between in- creasing the observability and total cost of the multiagent system. To choose a compromise, we propose, to take into account the degree of observability, deﬁned as the agent’s vision distance, for a coop- erative multiagent system by measuring the performance of the as- sociated learned policy. Obviously, decreasing observability reduces the number of accessible states for agents and therefore decreases the performance of the policy. We restrict our application to team game, a subclass of coordination problems where all agents have the same utility function. We consider problems where agents’ designer does not know the model of the world. Thus, we can use learning algo- rithms which have been proven to converge to Pareto-optimal equi- librium such as Friend Q-learning [4]. One can take an optimal al- gorithm to ﬁnd the policy for the observable problem. The following assumptions are deﬁned: (1) Mutually exclusive observations, each agent sees a partial view of the real state but all agents together see the real state. (2) Possible communication between agents but not considered as an explicit part of the decision making. (3) Only nega- tive interactions between agents. One problem which meets these as- sumptions is the choosing lane decision problem related to Intelligent Transportation Systems which aims to reduce congestion, pollution, stress and increase safety of the trafﬁc. 2 Formal Model and Algorithms Reinforcement learning allows an agent to learn by interacting with its environment. For a mono agent system, the basic formal model for reinforcement learning is a Markov decision process. Us- ing this model, Q-Learning algorithm calculates the optimal values of the expected reward for the agent in a state s if the action a is ex- ecuted. On the other hand, game theory studies formally the interac- tions of rational agents. In a one-stage game, each agent has to choose an action to maximize its own utility which depends on the others’ actions. In game theory, the main solution concept is the Nash equi- librium which is the best response for all agents. A solution is Pareto optimal if there does not exist any other solution such that one agent can improve its reward without decreasing the reward of another. The model which combines reinforcement learning and game theory, is 1 This research is funded by the AUTO21 Network of Centers of Excellence, an automotive research and development program focusing on issues re- lating to the automobile in the 21st century. AUTO21 is a member of the Networks of Centers of Excellence of Canada program. 2 DAMAS Laboratory, Department of Computer Science and Software Engi- neering, Laval University, Canada {jlaumoni;chaib}@damas.ift.ulaval.ca Markov games. This model contains a set of agents Ag, a ﬁnite set of states S, a ﬁnite set of actions A, a transition function P , an im- mediate reward function R. Among the algorithms which calculate a policy for team Markov games, Friend Q-Learning algorithm, intro- duced by Littman [4], allows to build a policy which is a Nash Pareto optimal equilibrium in team games. More speciﬁcally, this algorithm, based on Q-Learning, uses the following function for updating the Q- values: Q(s,a) = (1 - α)Q(s,a)+ α[r + γ max a∈  A Q(s ′ ,a)] with a, the joint action for all agents. 3 Problem Description The vehicle coordination problem presented here is adapted from Moriarty and Langley [5]. More precisely, three vehicles have to co- ordinate to maintain velocity and to avoid collisions. Each vehicle is represented by a position and a velocity and can change lane to the left, to the right or stay on the same lane. The objective for a learning algorithm is to ﬁnd the best policy for each agent in order to maximize the common reward which is the average velocity at each turn and to avoid collision. The dynamic, the state and the actions are sampled in the simplest way. For this example, we simulate the road as a ring meaning that a vehicle is placed on the left side when it quits through the right side. Collisions occur when two agents are in the same case. At each step, a vehicle can choose three actions: stay on the same lane, change to the right lane and change to the left lane. We assume, in this problem, that each agent is able to see only his local state and other’s states with communication. 4 Partial Observability To measure the effect of partial observability on the performance we deﬁne the partial state centered on one agent by introducing a distance of observability d. The initial problem becomes a d-partial problem. The distance d can be viewed as an inﬂuence area for the agent. Increasing this distance increases the degree of observability. We deﬁne d total as the maximal possible distance of observability for a given problem. In d-partial problem, the new state is deﬁned as the observation of the center agent for a range d. More precisely, an agent j is in the partial state of a central agent i if its distance is lower or equal than d from i. The function f i d (s) uses the parameter d to calculate the new local state. The partial local view can reduce the set of state and/or the set of joint actions. The new size of the set of state is set from O((X × Y ×|V |) N ) with X the number of lane, Y the length of the road, V the set of possible velocity and N the number of agents to O(((2d + 1) 2 × V ) N ). The number of states is divided by around (Y/(2d + 1)) N . The Partial Joint Action (PJA) algorithm takes into account only the actions of agents that are in the partial local view as speciﬁed by d. This reduces dramatically ECAI 2006 G. Brewka et al. (Eds.) IOS Press, 2006 © 2006 The authors. All rights reserved. 729