Robust Online Optimization of Reward-uncertain MDPs Kevin Regan and Craig Boutilier Department of Computer Science University of Toronto {kmregan, cebly}@cs.toronto.edu Abstract Imprecise-reward Markov decision processes (IR- MDPs) are MDPs in which the reward function is only partially specified (e.g., by some elicitation process). Recent work using minimax regret to solve IRMDPs has shown, despite their theoreti- cal intractability, how the set of policies that are nondominated w.r.t. reward uncertainty can be ex- ploited to accelerate regret computation. However, the number of nondominated policies is generally so large as to undermine this leverage. In this pa- per, we show how the quality of the approximation can be improved online by pruning/adding non- dominated policies during reward elicitation, while maintaining computational tractability. Drawing insights from the POMDP literature, we also de- velop a new anytime algorithm for constructing the set of nondominated policies with provable (any- time) error bounds. These bounds can be exploited to great effect in our online approximation scheme. 1 Introduction The use of Markov decision processes (MDPs) to model de- cision problems under uncertainty requires the specification of a large number of model parameters to capture both sys- tem dynamics and rewards. This specification remains a key challenge: while dynamics can be learned from data, resid- ual uncertainty in estimated parameters often remains; and reward specification typically requires sophisticated human judgement to assess relevant tradeoffs. For this reason, con- siderable attention has been paid to finding robust solutions to MDPs whose parameters are imprecisely specified (e.g., computing robust policies, in the maximin sense, given tran- sition probability uncertainty [2; 9; 12]). Recently, techniques for computing robust solutions for imprecise reward MDPs (IRMDPs) have been proposed [7; 15; 19]. The specification of rewards can be especially prob- lematic, since reward functions cannot generally be learned from experience (except for the most simple objectives in- volving observable metrics). Reward assessment requires the translation of general user preferences, and tradeoffs with re- spect to the relative desirability of states and actions, into precise quantities—an extremely difficult task, as is well- documented in the decision theory literature [8]. Further- more, this time-consuming process may need to be repeated for different users (with different preferences). Fortunately, a fully specified reward function is often not needed to make optimal (or near-optimal) decisions [15]. IR- MDPs are defined as MDPs in which the reward function lies in some set R (e.g., reflecting imprecise bounds on re- ward parameters). In this paper, we address the problem of fast, online computation of robust solutions for IRMDPs. We use minimax regret as our robustness criterion [15; 16; 19]. While solving IRMDPs using this measure is NP-hard [19], several techniques have been developed that allow the solution of small IRMDPs. Of particular note are meth- ods that exploit the set Γ of nondominated policies, i.e., those policies that are optimal for some element of R [16; 19]. Unfortunately, these methods scale directly with the number of dominated policies; and Γ is often too large to admit good computational performance. A subset ˜ Γ of the nondominated set can be used as an approximation, and, if specific bounds on the approximation quality of that set are known, error bounds on the minimax solution can be derived [16]; but methods for producing a suitable approximate set with the requisite bounds are lacking. We develop an approach for approximating minimax re- gret during elicitation by adjusting the subset ˜ Γ online. It allows us to make explicit tradeoffs between the quality of the approximation and the efficiency of minimax regret com- putation. In online elicitation of rewards, the feasible reward set R shrinks as users respond to queries about their prefer- ences. This means that some undominated policies in ˜ Γ be- come dominated and can be ignored, thus improving online computational efficiency. This, in turn, permits further non- dominated policies to be added to ˜ Γ, allowing for improve- ment in decision quality. To support online elicitation and computation, we also develop a new algorithm for construct- ing the set ˜ Γ in an anytime fashion that provides an upper bound on minimax regret. This algorithm is based on insights from Cheng’s [6] linear support method for POMDPs. We first review relevant background on MDPs, IRMDPs and minimax regret. We then discuss how nondominated policies exploited online during reward elicitation, and de- velop the nondominated/region vertex (NRV) algorithm for