J. Appl. Prob. 42, 15–26 (2005)
Printed in Israel
© Applied Probability Trust 2005
MULTI-ACTOR MARKOV DECISION PROCESSES
HYUN-SOO AHN,
∗
University of Michigan
RHONDA RIGHTER,
∗∗
University of California, Berkeley
Abstract
We give a very general reformulation of multi-actor Markov decision processes and
show that there is a tendency for the actors to take the same action whenever possible.
This considerably reduces the complexity of the problem, either facilitating numerical
computation of the optimal policy or providing a basis for a heuristic.
Keywords: Markov decision process; multiarmed bandit; flexible server
2000 Mathematics Subject Classification: Primary 90C40
Secondary 90B22
1. Introduction
There have been many nice results establishing the optimality of index rules for classes of
Markov decision processes with single actors. These include the traditional multiarmed bandit
[3], and scheduling in networks of queues with a single server [4], [6], [7]. When there are
multiple actors (players or servers), the problems become much more complicated, and simple
index rules are generally no longer optimal. We give a very general reformulation of multi-
actor Markov decision processes and give conditions under which there will be a tendency for
the actors to take the same action, whenever possible, and for priority to be given to faster
actors. This considerably reduces the complexity of the problem, either facilitating numerical
computation of the optimal policy or providing a basis for a heuristic.
Our framework is very general. Since a simple index rule is no longer optimal, we can relax
many of the assumptions required to obtain such a rule in previous work for single actors. We
permit general, exogenous, random effects on the system, actors with different speeds, arbitrary
constraints on which actors can take which actions, and all of these may be state dependent. Our
model includes quite general queues with multiple servers, multiarmed bandits with multiple
players, and data-flow models in which tokens (actors) can enable certain firings (state changes).
We are also able to show that our structural results hold for stochastic optimality as long as such
optimality is achievable. (By stochastic optimality we mean maximization of the net benefit
in the stochastic sense, rather than just maximization of the mean net benefit.) We also give
conditions under which the optimal policy can be implemented with distributed control. That
is, each actor can choose its own action to maximize its own marginal return.
Many results in the literature follow from ours. Ahn et al. [1] considered a two-station
queueing model with two flexible workers, Poisson arrivals, exponential service times, holding
costs, and preemption permitted. Thus, there are two actors and two actions. They showed that,
Received 7 April 2004; revision received 22 July 2004.
∗
Postal address: Operations and Management Science, University of Michigan Business School, 701 Tappan Street,
Ann Arbor, MI 48109-1234, USA. Email address: hsahn@umich.edu
∗∗
Postal address: Department of Industrial Engineering and Operations Research, University of California, Berkeley,
CA 94720, USA. Email address: rrighter@ieor.berkeley.edu
15
at https://www.cambridge.org/core/terms. https://doi.org/10.1239/jap/1110381367
Downloaded from https://www.cambridge.org/core. IP address: 54.163.42.124, on 25 May 2020 at 09:42:01, subject to the Cambridge Core terms of use, available