B 2 RTDP: An efficient solution for Bounded-Parameter Markov Decision Process Fernando L. Fussuma and Karina Valdivia Delgado School of Arts, Sciences and Humanities University of Sao Paulo Sao Paulo - Brazil Email: fernandofussuma@gmail.com and kvd@usp.br Leliane Nunes de Barros Departament of Computer Science, IME University of Sao Paulo Sao Paulo - Brazil Email: leliane@ime.usp.br Abstract—Bounded-parameter Markov decision process (BMDP) can be used to model sequential decision problems, where the transitions probabilities are not completely know and are given by intervals. One of the criteria used to solve that kind of problems is the maximin, i.e., the best action on the worst scenario. The algorithms to solve BMDPs that use this approach include interval value iteration and an extension of real time dynamic programming (Robust-LRTDP). In this paper, we introduce a new algorithm, named B 2 RTDP, also based on real time dynamic programming that makes a different choice of the next state to be visited using upper and lower bounds of the optimal value function. The empirical evaluation of the algorithm shows that it converges faster than the state-of-the-art algorithms that solve BMDPs. I. I NTRODUCTION Markov Decision Process (MDPs) [1] provides a mathe- matical framework for modeling the activity of sequential de- cision making, for example in the areas of planning, operations research and robotics. An MDP models the interaction between an agent and its environment in t stages-to-go. At each stage t the robot or software agent makes a choice of an action that have probabilistic effects and decides to perform it producing a future state and a reward. The goal of the agent is to maximize a value function, that can be the expected value of the sum of discounted rewards over a sequence of choices. Several algorithms, that use dynamic programming, have been proposed to solve MDPs. One classic algorithm is the value iteration [1] that updates the value of all states at each iteration. When the initial state is known, we can use a more efficient solution named Real-Time Dynamic Programming (RTDP) [2] and its extensions Labeled Real- Time Dynamic Programming (LRTDP) [3], Bounded Real- Time Dynamic Programming (BRTDP) [4], Focused Real- Time Dynamic Programming [5] and Bayesian Real-Time Dynamic Programming [6]. However, known extensions of an MDP are more suit- able to represent practical problems of great interest for real applications in special an MDP whose probabilities are not completely known and where constraints over probabilities are defined, called Markov Decision Process with Imprecise Prob- abilities (MDP-IP) that has been proposed in the 1970’s [7]. A particular case of MDP-IP has been proposed in the late 1990’s, called Bounded-parameter Markov Decision Process (BMDP) [8]. A BMDP is an MDP in which the transition probabilities and rewards are defined by intervals. Since the transitions in both of these problems are imprecise, there are infinite models of probabilities to choose. There are various criteria to evaluate a policy for both MDP-IP and BMDP. One of them is the maximin criterion, which considers the best scenario in the worst case. To solve an MDP-IP, the main computational bottleneck is the need to repeatedly solve optimization problems in order to consider the worst case probabilities [9]–[11]. To solve a BMDP we can explore an additional information: the structure of the intervals. Using that information we can avoid calls to an optimizer by applying a greedy method for choosing the probabilities. The interval value iteration is an algorithm [8] that uses this greedy method to solve BMDPs. There is also a solution for BMDPs that includes an extension of LRTDP, named Robust- LRTDP [12]. In this work, we propose a new algorithm for solving a BMDP based on the BRTDP algorithm [4], named B 2 RTDP, which converges faster than the Robust-LRTDP by making a better choice of the next state to be visited. In Section II we introduce the main concepts related to an MDP, we present the definition of a BMDP and existing solutions. In Section III we present a new algorithm for solving BMDPs that uses upper and lower bounds of the value function to do a better choice of the next state to be visited and to verify convergence. In Section IV we evaluate the proposed algorithm in terms of convergence time. Finally, in Section V we show the conclusions. II. BACKGROUND A. Markov Decision Process Formally an MDP is defined by the tuple M = 〈S, A, p, r, γ 〉, where: • S is a finite set of observable states; • A is a finite set of actions that can be executed by the agent; • p(s ′ |s, a) is the transition function that represents the probability that the future state be s ′ given that the action a was applied in the state s; • r is a function that associates a reward to each state;