Adaptation for Changing Stochastic Environments through Online POMDP Policy Learning Guy Shani, Ronen I. Brafman, and Solomon E. Shimony Ben-Gurion University, Beer-Sheva, Israel {shanigu,brafman,shimony}@cs.bgu.ac.il Abstract. Computing optimal or approximate policies for partially observable Markov decision processes (POMDPs) is a difficult task. When in addition the characteristics of the environment change over time, the problem is further com- pounded. A policy that was computed offline may stop being useful after suffi- cient changes to the environment have occurred. We present an online algorithm for incrementally improving POMDP policies, that is highly motivated by the Heuristic Search Value Iteration (HSVI) approach — locally improving the cur- rent value function after every action execution. Our algorithm adapts naturally to slow changes in the environment, without the need to explicitly model the changes. In initial empirical evaluation our algorithm shows a marked improve- ment over other online POMDP algorithms. 1 Introduction Consider an agent situated in a partially observable domain: It executes an action that may change the state of the world; this change is reflected, in turn, by the agent’s sen- sors; the action may have some associated cost, and the new state may have some asso- ciated reward or penalty. Thus, the agent’s interaction with this environment is charac- terized by a sequence of action-observation-reward steps. Our goal is to have the agent act optimally (in the sense of expected reward) given what it knows about the world. Our focus is thus on agents with imperfect and noisy sensors, in the well-known frame- work of partially observable Markov decision processes (see section 2 for an overview of POMDPs). Finding the optimal policy for a POMDP is known to be computationally intractable in the worst case, proved to be PSPACE-hard. While there are numerous algorithms for solving POMDPs with various restrictions, the difficulty of the general problem has prompted the developement of numerous approximation algorithms. When the (stochas- tic) behavior of the environment does not vary over time, we may apply one of these approximate schemes. These techniques may take a long time to produce a policy that is good enough, but as this effort is executed offline, it does not effect the online policy execution. However, when the above assumption of a static environment does not hold, a pol- icy that was optimal may become very far from optimal, as changes in the environment parameters (changes in the reward function, the transition probabilities, or sensor ac- curacy) accumulate. The naive solution is a costly re-computation of the policy. There are two problems in the naive approach: a) Until the decision to re-compute is made, the agent is acting according to a sub-optimal policy, and b) complete re-computation