Context MDPs Christos Dimitrakakis FIAS, University of Frankfurt, Germany christos.dimitrakakis@gmail.com September 9, 2010 Abstract This paper presents a simple method for exact online inference and approximate decision making, applicable to large or partially observable Markov decision processes. The approach is based on a closed form Bayesian inference procedure for a class of context models which con- tains variable order Markov decision processes. The models can be used for prediction, and thus for decision theoretic planning. The other novel step of this paper is to use the belief (context distribution) at any given time as a compact representation of system state, in a manner similar to predictive state representations. Since the belief update is linear time in the worst case, this allows for computationally efficient value iteration and reactive learning algorithms such as Q-learning for this class of models. 1 Introduction We consider estimation of a class of context models that can approximate ei- ther large or partially observable Markov decision processes (MDPs). This is closely related to the context tree weighting algorithm (CTW) for discrete se- quence prediction (Willems et al., 1995). We present a constructive definition of a context process, extending the one proposed in (Dimitrakakis, 2010) to the estimation of variable order MDPs. With a suitable choice of context structure, the construction is applicable to large or continuous MDPs as well. We intro- duce two simple algorithms, the weighted context value iteration and weighted context Q-learning, for decision making in unknown environments with contin- uous or partially observable state. Finally, we provide preliminary experimental results on the predictive, state representation and decision making capabilities of the methods. We consider discrete-time decision problems in unknown environments, with a known set of actions A chosen by the decision maker, and a set of observations Z drawn from some unknown process µ, to be made precise later. At each time step t ∈ N, the decision maker observes z t ∈Z , selects an action a t ∈A and receives a reward r t ∈ R. 1