NMRDPP: Decision-Theoretic Planning with Control Knowledge. Charles Gretton, David Price, and Sylvie Thi´ ebaux Computer Sciences Laboratory The Australian National University Canberra, ACT, Australia {charlesg,davidp,thiebaux}@csl.anu.edu.au Abstract We discuss NMRDPP, a system for solving decision pro- cesses with non-Markovian reward. More specifically, target decision processes exhibit Markovian dynamics and reward- ing behaviours are modelled as state trajectories specified in a linear temporal logic. In addition to implementing structured, tabular and online MDP solution algorithms, NMRDPP can exploit domain specific control knowledge. State trajectories which violate the users knowledge/intuition regarding use- ful dynamics can be pruned from consideration by the MDP solution algorithm. Thus, in addition to facilitating concise specification of complex reward structures, NMRDPP can be used to greatly speed up policy computation for propositional MDPs. To our knowledge, NMRDPP is the only implemen- tation of solution algorithms designed to solve decision pro- cesses with non-Markovian rewards. Introduction NMRDPP (Gretton et al. 2003) (non–Markovian Reward Decision Process Planner), is a general purpose planner for non-Markovian reward 1 (and hence also Markovian) propo- sitional decision processes. Target decision processes are usually stochastic, exhibiting Markovian dynamics. The re- ward is modelled as a set of state trajectories, called be- haviours, specified in a linear temporal logic. NMRDPP was originally developed in order to carry out an experimental evaluation of approaches for solving decision processes with non-Markovian reward. Implemented in C++, NMRDPP supports a range of experimental algorithms and frameworks for solving NMRDPs. It is suited to participation in IPPC’04 as it facilitates planning in completely observable stochastic domains. NMRDPP is the first of its kind; previously no ap- proaches to solving NMRDP had been fully implemented, and there was no work presenting any experimental results. There have been two proposals regarding languages suit- able for expressing rewarding behaviours. These include PLTL (Bacchus et al. 1996) a linear temporal logic of the past and $FLTL (Thi´ ebaux et al. 2002) a linear tempo- ral logic of the future with reward. In either case, NM- RDPP translates NMRDPs into corresponding equivalent MDPs (XMDPs) which incorporate temporal variables cap- turing sufficient history to make the reward of the expanded 1 For our purposes reward can be negative, thus we don’t distin- guish between reward and cost. process Markovian 2 . Available translation procedures are unique and not particularly straightforward (Bacchus et al. 1996; Bacchus et al. 1997; Thi´ ebaux et al. 2002). NM- RDP solution algorithms differ in their representations of domain dynamics, the XMDP and in the class, structured or non-structured, of MDP solution methods to which they are tied. NMRDPP can solve target decision problems online (during translation) using LAO* heuristic search techniques (Hansen and Zilberstein 2001). Alternatively, the complete XMDP can be generated and passed to classical structured or tabular policy computation algorithms such as SPUDD (Boutilier et al. 1995; Hoey et al. 1999) or policy/value iteration (Howard 1960) respectively. Using the same mechanisms devised for non-Markovian reward, state trajectories which violate the users knowl- edge/intuition regarding useful dynamics can be pruned from consideration by the MDP solution algorithm. The specification of a set of such state sequences is called control knowledge, and has been used to great effect by the deter- ministic planning community (Bacchus and Kabanza 2000). Thus, although there is no advantage to be gleaned from con- cise specification of complex non-Markovian reward during the competition, NMRDPP can exploit control knowledge to greatly speed up policy computation given propositional MDPs. By pruning states which violate specific behaviours, we can mitigate the effect of Bellman’s so called curse of dimensionality. In the remainder of this document, we shall present an overview of MDPs and NMRDPs and discuss their differ- ences. We shall briefly discuss the logics that have been adopted to model reward and control knowledge, focusing in particular on $FLTL. We shall provide some examples of using $FLTL to specify control knowledge for a stochastic blocks-world domain. We shall conclude by summarising how we intend to compete using NMRDPP in the IPPC’04. MDPs and NMRDPs Problem domains which participants shall consider during the main and learning IPPC’04 tracks, although specified in PPDDL1.0 (Younes and Littman 2004), can be mod- elled using the MDP formalism. Indeed, decision theoretic 2 There is a mapping from XMDP states to the reals. 1