AbstractWe consider the Canadian Traveler Problem (CTP) wherein an agent needs to traverse a given graph's edges that may or may not be blocked. The agent can observe the actual status of an edge only upon reaching either end of the edge. To aid its traversal, the agent is given prior blockage probabilities associated with each edge. The goal is to devise an algorithm that minimizes the expected traversal cost between two given nodes. Both penalty-based and rollout-based algorithms have been shown separately to provide high quality policies for CTP. In this study, we compare these two algorithmic frameworks via computational experiments involving Delaunay and grid graphs using one specific penalty-based algorithm and four rollout-based algorithms. Our results indicate that the penalty-based algorithm executes several orders of magnitude faster than rollout-based ones while also providing better policies, suggesting that penalty-based algorithms stand as a prominent candidate for fast and efficient sub-optimal solution of CTP. Index TermsProbabilistic path planning, canadian traveler problem, penalty-based algorithm, rollout-based algorithm. I. INTRODUCTION The Canadian Traveler Problem (CTP) is a probabilistic path planning problem that is a representation of a situation Canadian drivers encounter. When a driver reaches an intersection and observes that the road ahead is blocked due to heavy snow, the driver looks for another route. In the graph theoretic analogue of this situation, an agent is given probabilities associated with traversability of each edge in a graph and the goal is to devise a policy 1 that will result in the shortest expected traversal cost between given starting and termination points. CTP has applications in robot navigation in stochastic domains [1]-[3] adaptive traffic routing [4]-[6] and naval minefield countermeasures [7]-[12]. Along with practical applications, CTP has interesting theoretical properties, which enables it to be cast as a Markov Decision Process (MDP) with exponentially many states (hence its intractability), or Partially Observable Markov Decision Process (POMDP) with deterministic observations. It can actually be shown that CTP belongs to an intermediate set of problems, called Deterministic POMDPs that allow for state uncertainty, meanwhile avoiding noisy observations [13], Manuscript received December 9, 2014; revised February 26, 2015. This research was supported by The Scientific and Technological Research Council of Turkey (TUBITAK), Grant No. 111M541 and 113M489. The authors are with the Department of Industrial Engineering, Istanbul Sehir Univ., Istanbul, 34662, Turkey (e-mail: furkansahin@std.sehir.edu.tr, aksakalli@sehir.edu.tr). 1 The terms solution and policy shall be used interchangeably in this manuscript. [14]. An AO*-based optimal algorithm has recently been introduced for CTP that runs several orders of magnitude faster than the classical AO* and value iteration [13]. The new algorithm, called CAO*, improves upon AO* by utilizing two key features: (1) a caching mechanism to avoid re-expanding visited states, and (2) dynamic upper and lower bounds at a node level for further state-space pruning. Optimal algorithms for special cases of CTP have also been studied [14], [15]. Approximation algorithms and heuristics for CTP have been introduced as well [16]-[18]. In this context, Eyerich et al. [19] made a significant contribution by introducing and evaluating sampling-based (also known as rollout-based) probabilistic algorithms for CTP on both theoretical and empirical fronts. Although they show that a new UCT-based [20] rollout algorithm (called Optimistic UCT) converges to a global optimum, a major limitation of rollout-based approaches in general is that they do not scale well with large instances in terms of execution time. Hence, the need for efficient and effective CTP algorithms arises. A penalty-based algorithm for CTP generalizes the well-known optimism approach by incorporating a penalty term in the agent's traversal that discourages the agent from traversing edges that are farther away from the termination and/or edges that have high blockage probability. In particular, a penalty-based algorithm calls for successive execution of a deterministic shortest path algorithm with respect to a particular penalty function until the agent's arrival at the termination. One particular penalty-based algorithm called the Distance-to-Termination (DT) Algorithm was evaluated by utilizing CAO* as a benchmark and it was shown to find high quality policies in very short execution times [21]. One attractive feature of penalty-based algorithms is that they scale quite well in terms of the problem size relative to rollout-based approaches. II. CTP FORMULATION Let G = ( V, E) be an undirected graph. An agent wishes to A Comparison of Penalty and Rollout-Based Algorithms for the Canadian Traveler Problem O. Furkan Sahin and Vural Aksakalli 319 International Journal of Machine Learning and Computing, Vol. 5, No. 4, August 2015 DOI: 10.7763/IJMLC.2015.V5.527 Our goal in this study is to compare the penalty-based DT Algorithm against four rollout-based ones both in terms of execution time and solution quality for random CTP instances defined on Delaunay and grid graphs. Our purpose is to assess relative merits of these two algorithmic frameworks on an empirical basis. The rest of this manuscript is organized as follows: Section II is devoted to formal definition of CTP. Section III describes the penalty and rollout-based algorithms. The computational experiments are presented in Section IV, which is followed by a summary and our conclusions.