Exploration in POMDP belief space and its impact on value iteration approximation Masoumeh T. Izadi , Doina Precup Abstract. Decision making under uncertainty is among the most challenging tasks in the artificial intelligence. Although solution methods to this class of problems are intractable in general, some promising approximation methods have been proposed recently. In particular, point-based planning algorithms for solving partially ob- servable Markov decision processes (POMDPs) have demonstrated that a good approximation of the value function can be obtained by interpolating between the values of a selected set of points. The agent must make a choice as to how to sample these points. Ideally, we need to sample in order to build an accurate approximation in less time. In this paper, we relate this problem to the exploration- exploitation tradeoff in the space of POMDP reachable beliefs. Fur- thermore, we show that there exists an influential control parameter for this tradeoff. As a result, we provide a controllable tighter bound for the point-based value iteration (PBVI) approximation [4] based on knowledge about the domain. We study two criteria designed to improve point-based value iteration algorithms when selecting candi- date points. The first is based on reachability analysis from the given initial belief state. The second criterion is based on the degree of stochasticity of the problem domain and the topological structure of possible beliefs experienced by the agent. We present an empirical evaluation illustrating the effect of these criteria on the performance of point-based value iteration. 1 Introduction Partially Observable Markov Decision Processes (POMDPs) provide a standard framework for studying decision making under uncer- tainty. In a POMDP, the state of the system in which the decisions take place is never fully observed. Only observations that depend probabilistically on the hidden state are available. POMDPs have gained a lot of attention in the AI and operations research community and several planning algorithms have been developed. However, the best exact algorithms for POMDPs can be very inefficient in terms of both space and time requirements. Therefore a huge research ef- fort has been devoted to developing approximation techniques in this field. Most planning algorithms attempt to estimate values for belief states, i.e. probability distributions over the hidden states of the sys- tem. Recent research has been devoted to algorithms that take ad- vantage of the fact that for most POMDP problems, a large part of the belief space is never experienced by the agent. Such approaches, which are known as point-based methods, consider only a finite set of belief points and compute values for the different actions only for these points. The generalization over the entire simplex is done based McGill University Montreal, Quebec,Canada email: mtabae@cs.mcgill.ca, dprecup@cs.mcgill.ca on the assumption that “nearby” points (in terms of the L1 norm) will have close values. This assumption is based on the fact that the opti- mal value function ia a piecewise linear and convex function over the continuous belief space. Point-based value iteration methods have been used very successful in solving problems which are orders of magnitude larger than classical POMDP problems. This algorithm performs point-based updates on a small set of reachable points. The error of the approximation is proved to be bounded and it can be decreased by expanding the set of beliefs. However, value improvement depends to a large extent on which be- lief points are added to this set. Hence, the choice of belief points is a crucial problem in point-based value iteration, especially when deal- ing with large problems, and has been discussed by several authors. Spaan and Vlassis [8] explored the use of a large set of randomly gen- erated reachable points. Pineau et al. discussed several heuristics for sampling reachable belief states. Smith and Simmons [5] designed a heuristic search value iteration algorithm which maintains an upper and lower bound on the value function to guide the search for good beliefs. In this paper we address the issue of dynamically generating, in an efficient way, a good ordering of beliefs that should be consid- ered. We explore the point-based value iteration algorithm in com- bination with a number of belief point selection heuristics. First, we make some corrections to the reachability metric proposed by Smith and Simmons [6] which were discovered via private communications with the authors. This metric is designed to give more priority of be- ing selected to points that are reachable in the near future. The in- tuition is that in discounted reward problems, belief points that are only reachable in many time steps do not play an important part in the computation of the value function approximation and we can ig- nore them. We compare this metric to the 1-norm distance metric previously suggested by Pineau et.al [4] and study the applicability of this metric to the point-based value iteration algorithm. Our study also points out the fundamental exploration versus ex- ploitation dilemma that appears throughout decision theory, in the context of sampling the reachable beliefs for which to do value back- ups. We propose and investigate a new strategy for point selection in PBVI. The main idea is to give priority to beliefs that are reach- able in closer future while still considering the distance of the candi- date point to the current set . This way, we avoid the overwhelming complexity of considering all reachable belief states in a breadth-first manner, but at the same time we try to pick points that can provide a better approximation. This is motivated by the observation that the complexity of the optimal value function can be inferred up to some extent from the difference between the number of belief states being backed up, , and the number of alpha vectors representing the current approximate value function, . Whenever this difference