Stationary Deterministic Policies for Constrained MDPs with Multiple Rewards, Costs, and Discount Factors Dmitri Dolgov and Edmund Durfee Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109 {ddolgov, durfee}@umich.edu Abstract We consider the problem of policy optimization for a resource-limited agent with multiple time- dependent objectives, represented as an MDP with multiple discount factors in the objective function and constraints. We show that limiting search to sta- tionary deterministic policies, coupled with a novel problem reduction to mixed integer programming, yields an algorithm for ﬁnding such policies that is computationally feasible, where no such algorithm has heretofore been identiﬁed. In the simpler case where the constrained MDP has a single discount factor, our technique provides a new way for ﬁnd- ing an optimal deterministic policy, where previous methods could only ﬁnd randomized policies. We analyze the properties of our approach and describe implementation results. 1 Introduction Markov decision processes [Bellman, 1957] provide a sim- ple and elegant framework for constructing optimal policies for agents in stochastic environments. The classical MDP for- mulations usually maximize a measure of the aggregate re- ward received by the agent. For instance, in widely-used dis- counted MDPs, the objective is to maximize the expected value of a sum of exponentially discounted scalar rewards received by the agent. Such MDPs have a number of very nice properties: they are subject to the principle of local op- timality, according to which the optimal action for a state is independent of the choice of actions for other states, and the optimal policies for such MDPs are stationary, deterministic, and do not depend on the initial state of the system. These properties translate into very efﬁcient dynamic-programming algorithms for constructing optimal policies for such MDPs (e.g., [Puterman, 1994]), and policies that are easy to imple- ment in standard agent architectures. This material is based upon work supported by Honeywell Interna- tional, and by the DARPA/IPTO COORDINATORs program and the Air Force Research Laboratory under Contract No. FA8750–05–C– 0030. The views and conclusions contained in this document are those of the authors, and should not be interpreted as representing the ofﬁcial policies, either expressed or implied, of the Defense Ad- vanced Research Projects Agency or the U.S. Government. However, there are numerous domains where the classical MDP model proves inadequate, because it can be very dif- ﬁcult to fold all the relevant feedback from the environment (i.e., rewards the agent receives and costs it incurs) into a sin- gle scalar reward function. In particular, the agent’s actions, in addition to producing rewards, might also incur costs that might be measured very differently from the rewards, making it hard or impossible to express both on the same scale. For example, a natural problem for a delivery agent is to maxi- mize aggregate reward for making deliveries, subject to con- straints on the total time spent en route. Problems naturally modeled as constrained MDPs also often arise in other do- mains: for example, in telecommunication applications (e.g., [Lazar, 1983]), where it is desirable to maximize throughput subject to delay constraints. Another situation where the classical MDP model is not expressive enough is where an agent receives multiple re- ward streams and incurs multiple costs, each with a different discount factor. For example, the delivery agent could face a rush-hour situation where the rewards for making deliver- ies decrease as a function of time (same delivery action pro- duces lower reward if it is executed at a later time), while the trafﬁc conditions improve with time (same delivery ac- tion can be executed faster at a later time). If the rewards de- crease and trafﬁc conditions improve on different time scales, the problem can be naturally modeled with two discount fac- tors, allowing the agent to evaluate the tradeoffs between early and late delivery. Problems with multiple discount fac- tors also frequently arise in other domains: for example, an agent can be involved in several ﬁnancial ventures with dif- ferent risk levels and time scales, where a model with multiple discount factors would allow the decision maker to quantita- tively weigh the tradeoffs between shorter- and longer-term investments. Feinberg and Shwartz [1999] describe more ex- amples and provide further justiﬁcation for constrained mod- els with several discount factors. The price we have to pay for extending the classical model by introducing constraints and several discount factors is that stationary deterministic policies are no longer guaranteed to be optimal [Feinberg and Shwartz, 1994; 1995]. Searching for an optimal policy in the larger class of non-stationary randomized policies can dramatically increase problem com- plexity; in fact, the complexity of ﬁnding optimal policies for this broad class of constrained MDPs with multiple costs, re-