Partial Local FriendQ Multiagent Learning: Application
to Team Automobile Coordination Problem
1
Julien Laumonier and Brahim Chaib-draa
2
1 Introduction
In real world cooperative multiagent problems, each agent has of-
ten a partial view of the environment. If communication has a cost,
the multiagent system designer has to find a compromise between in-
creasing the observability and total cost of the multiagent system. To
choose a compromise, we propose, to take into account the degree
of observability, defined as the agent’s vision distance, for a coop-
erative multiagent system by measuring the performance of the as-
sociated learned policy. Obviously, decreasing observability reduces
the number of accessible states for agents and therefore decreases the
performance of the policy. We restrict our application to team game,
a subclass of coordination problems where all agents have the same
utility function. We consider problems where agents’ designer does
not know the model of the world. Thus, we can use learning algo-
rithms which have been proven to converge to Pareto-optimal equi-
librium such as Friend Q-learning [4]. One can take an optimal al-
gorithm to find the policy for the observable problem. The following
assumptions are defined: (1) Mutually exclusive observations, each
agent sees a partial view of the real state but all agents together see
the real state. (2) Possible communication between agents but not
considered as an explicit part of the decision making. (3) Only nega-
tive interactions between agents. One problem which meets these as-
sumptions is the choosing lane decision problem related to Intelligent
Transportation Systems which aims to reduce congestion, pollution,
stress and increase safety of the traffic.
2 Formal Model and Algorithms
Reinforcement learning allows an agent to learn by interacting
with its environment. For a mono agent system, the basic formal
model for reinforcement learning is a Markov decision process. Us-
ing this model, Q-Learning algorithm calculates the optimal values
of the expected reward for the agent in a state s if the action a is ex-
ecuted. On the other hand, game theory studies formally the interac-
tions of rational agents. In a one-stage game, each agent has to choose
an action to maximize its own utility which depends on the others’
actions. In game theory, the main solution concept is the Nash equi-
librium which is the best response for all agents. A solution is Pareto
optimal if there does not exist any other solution such that one agent
can improve its reward without decreasing the reward of another. The
model which combines reinforcement learning and game theory, is
1
This research is funded by the AUTO21 Network of Centers of Excellence,
an automotive research and development program focusing on issues re-
lating to the automobile in the 21st century. AUTO21 is a member of the
Networks of Centers of Excellence of Canada program.
2
DAMAS Laboratory, Department of Computer Science and Software Engi-
neering, Laval University, Canada {jlaumoni;chaib}@damas.ift.ulaval.ca
Markov games. This model contains a set of agents Ag, a finite set
of states S, a finite set of actions A, a transition function P , an im-
mediate reward function R. Among the algorithms which calculate a
policy for team Markov games, Friend Q-Learning algorithm, intro-
duced by Littman [4], allows to build a policy which is a Nash Pareto
optimal equilibrium in team games. More specifically, this algorithm,
based on Q-Learning, uses the following function for updating the Q-
values: Q(s,a) = (1 - α)Q(s,a)+ α[r + γ max
a∈
A
Q(s
′
,a)] with
a, the joint action for all agents.
3 Problem Description
The vehicle coordination problem presented here is adapted from
Moriarty and Langley [5]. More precisely, three vehicles have to co-
ordinate to maintain velocity and to avoid collisions. Each vehicle
is represented by a position and a velocity and can change lane to
the left, to the right or stay on the same lane. The objective for a
learning algorithm is to find the best policy for each agent in order to
maximize the common reward which is the average velocity at each
turn and to avoid collision. The dynamic, the state and the actions
are sampled in the simplest way. For this example, we simulate the
road as a ring meaning that a vehicle is placed on the left side when
it quits through the right side. Collisions occur when two agents are
in the same case. At each step, a vehicle can choose three actions:
stay on the same lane, change to the right lane and change to the left
lane. We assume, in this problem, that each agent is able to see only
his local state and other’s states with communication.
4 Partial Observability
To measure the effect of partial observability on the performance
we define the partial state centered on one agent by introducing a
distance of observability d. The initial problem becomes a d-partial
problem. The distance d can be viewed as an influence area for the
agent. Increasing this distance increases the degree of observability.
We define d
total
as the maximal possible distance of observability
for a given problem. In d-partial problem, the new state is defined
as the observation of the center agent for a range d. More precisely,
an agent j is in the partial state of a central agent i if its distance is
lower or equal than d from i. The function f
i
d
(s) uses the parameter
d to calculate the new local state. The partial local view can reduce
the set of state and/or the set of joint actions. The new size of the
set of state is set from O((X × Y ×|V |)
N
) with X the number
of lane, Y the length of the road, V the set of possible velocity and
N the number of agents to O(((2d + 1)
2
× V )
N
). The number of
states is divided by around (Y/(2d + 1))
N
. The Partial Joint Action
(PJA) algorithm takes into account only the actions of agents that are
in the partial local view as specified by d. This reduces dramatically
ECAI 2006
G. Brewka et al. (Eds.)
IOS Press, 2006
© 2006 The authors. All rights reserved.
729