Monte Carlo Tree Search with Useful Cycles for Motion Planning Bilal Kartal Abstract— An autonomous robot team can be employed for continuous and strategic coverage of arbitrary environments for different missions. Stochastic multi-robot planning is a very powerful technique for several problem domains. Many planning problems, e.g. swarm robotics, coverage planning, and multi-robot patrolling, require high degree of coordina- tion which yields scalability issues for traditional joint-space planners. The other main challenge for traditional joint-space planners is the exploration versus exploitation trade-off during policy search. Exploration versus exploitation dilemma is very well studied in the context of Multi-armed bandit problem. Stochastic sampling based planners employ the multi-armed bandit theory to address the aforementioned challenges. Par- ticularly in this work, we have been investigating stochastic tree search approaches in policy space for the multi-robot patrolling problems. We proposed a new variant of Monte Carlo Tree Search algorithm for life-long policies by exploiting periodic trajectories of the robot team. I. I NTRODUCTION Multi-robot systems are nowadays commonly used to perform critical tasks, such as search and rescue operations, intelligent farming, mine sweeping and environmental moni- toring [9], [11], [12]. All of these problems require coverage of environments based on some optimization criteria. One instance of these coverage planning problems is the multi- robot patrolling problem where multiple robots must cover patrol terrain strategically in a coordinated fashion to prevent intrusions. The multi-robot patrolling problem has been an active research area within the last decade, especially as more and more autonomous robots are available for surveillance tasks at low costs. One common problem formulation is an optimization based one minimizing the idleness, i.e. the maximum time difference between any visits to any node, and this problem is NP -hard as it can be reduced to the well-known TSP problem. Recent works includes extending the longevity of the patrolling task by considering robot batteries [1], decentralized patrolling based on Gaussian pro- cesses [7], and patrolling in case of coordinated attacks [10]. II. CONTRIBUTIONS In this section, we present our current contributions and ongoing research. We study the multi-robot patrolling prob- lem for two types of intruder models, i.e. probabilistic static, and dynamic intruders. We formulate the patrolling policy generation problem as a tree search problem and as a baseline we employ Monte Carlo Tree Search (MCTS) [5]. MCTS is a breakthrough algorithm in particularly AI community and it has been also applied to several other domains [4], [6], The author is with the Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA. {bilal}@cs.umn.edu Fig. 1. MCTS-UC patrol policy is employed by 2 patrollers. Both patrollers start at top left cell, but the grid is partitioned into two cycles emergently and covered continuously by patrollers. [8]. MCTS can successfully search in large domains by using random sampling. The algorithm is anytime and converges to optimal solutions given enough time and memory for finite- horizon problems. A. Monte Carlo Tree Search with Useful Cycles One of the main challenges to adapt MCTS to the pa- trolling domain is the ability to generate infinite length policies; the policies generated by MCTS are valid for a small time horizon while patrolling task has to be performed continuously. This is a very important difference from the single time coverage problems. We address the continuous patrolling challenge by introducing Monte Carlo Tree Search with Useful Cycles, MCTS-UC, which augments standard MCTS with cyclic nodes to return infinite, cyclic policies [2] with convergence guarantees for finite horizon problems. We present an overview of our main contribution, MCTS-UC, in Figure 2. We define a useful cycle as a set of patrolling trajectories that starts and ends in the same vertex set for the robot team. Therefore, we exploit the spatial similarity of visited vertices of patrollers, i.e. whether the same set of vertices are visited between any two states or not, to determine a useful cycle. In terms of search tree structure, MCTS-UC creates artificial cyclic nodes which represent continuous policies. These nodes will be part of the tree search during exploration-exploitation. Consider, for example, two equivalent nodes A and B as shown in Fig. 2(a). Given these nodes, a cyclic node, node C, is created as a sibling arm to node B and its cyclic parent