J. Appl. Prob. 42, 15–26 (2005) Printed in Israel © Applied Probability Trust 2005 MULTI-ACTOR MARKOV DECISION PROCESSES HYUN-SOO AHN, University of Michigan RHONDA RIGHTER, ∗∗ University of California, Berkeley Abstract We give a very general reformulation of multi-actor Markov decision processes and show that there is a tendency for the actors to take the same action whenever possible. This considerably reduces the complexity of the problem, either facilitating numerical computation of the optimal policy or providing a basis for a heuristic. Keywords: Markov decision process; multiarmed bandit; flexible server 2000 Mathematics Subject Classification: Primary 90C40 Secondary 90B22 1. Introduction There have been many nice results establishing the optimality of index rules for classes of Markov decision processes with single actors. These include the traditional multiarmed bandit [3], and scheduling in networks of queues with a single server [4], [6], [7]. When there are multiple actors (players or servers), the problems become much more complicated, and simple index rules are generally no longer optimal. We give a very general reformulation of multi- actor Markov decision processes and give conditions under which there will be a tendency for the actors to take the same action, whenever possible, and for priority to be given to faster actors. This considerably reduces the complexity of the problem, either facilitating numerical computation of the optimal policy or providing a basis for a heuristic. Our framework is very general. Since a simple index rule is no longer optimal, we can relax many of the assumptions required to obtain such a rule in previous work for single actors. We permit general, exogenous, random effects on the system, actors with different speeds, arbitrary constraints on which actors can take which actions, and all of these may be state dependent. Our model includes quite general queues with multiple servers, multiarmed bandits with multiple players, and data-flow models in which tokens (actors) can enable certain firings (state changes). We are also able to show that our structural results hold for stochastic optimality as long as such optimality is achievable. (By stochastic optimality we mean maximization of the net benefit in the stochastic sense, rather than just maximization of the mean net benefit.) We also give conditions under which the optimal policy can be implemented with distributed control. That is, each actor can choose its own action to maximize its own marginal return. Many results in the literature follow from ours. Ahn et al. [1] considered a two-station queueing model with two flexible workers, Poisson arrivals, exponential service times, holding costs, and preemption permitted. Thus, there are two actors and two actions. They showed that, Received 7 April 2004; revision received 22 July 2004. Postal address: Operations and Management Science, University of Michigan Business School, 701 Tappan Street, Ann Arbor, MI 48109-1234, USA. Email address: hsahn@umich.edu ∗∗ Postal address: Department of Industrial Engineering and Operations Research, University of California, Berkeley, CA 94720, USA. Email address: rrighter@ieor.berkeley.edu 15 at https://www.cambridge.org/core/terms. https://doi.org/10.1239/jap/1110381367 Downloaded from https://www.cambridge.org/core. IP address: 54.163.42.124, on 25 May 2020 at 09:42:01, subject to the Cambridge Core terms of use, available