Planning Time to Think: Metareasoning Bence Cserna, Wheeler Ruml University of New Hampshire Durham, NH 03824 USA {bence,ruml} @cs.unh.edu Jeremy Frank NASA Ames Research Center Moffett Field, CA 94035 USA jeremy.d.frank@nasa.gov Abstract When minimizing makespan during off-line planning, the fastest action sequence to reach a particular state is, by deﬁni- tion, preferred. When trying to reach a goal quickly in on-line planning, previous work has inherited that assumption: the faster of two paths that both reach the same state is usually considered to dominate the slower one. In this short paper, we point out that, when planning happens concurrently with execution, selecting a slower action can allow additional time for planning, leading to better plans. We present Slo’RTS, a metareasoning planning algorithm that estimates whether the expected improvement in future decision-making from this increased planning time is enough to make up for the in- creased duration of the selected action. Using simple bench- marks, we show that Slo’RTS can yield shorter time-to-goal than a conventional planner. This generalizes previous work on metareasoning in on-line planning and highlights the in- herent uncertainty present in an on-line setting. Introduction Traditionally, planning has been considered from an off-line perspective, in which plan synthesis is completed before plan execution begins. In that setting, if the objective is to minimize plan makespan, then it is clearly advantageous to return a faster plan to achieve the goal. For example, in a heuristic search-based approach to planning, if the planner discovers two alternative plans for achieving the same state, it only needs to retain the faster of the two. In A* search, this corresponds to the usual practice of retaining only the copy of a duplicate state that has the lower g value. However, many applications of planning demand an on- line approach, in which the objective is to achieve a goal as quickly as possible, and planning takes place concur- rently with execution. For example, while the agent is tran- sitioning from state s 1 to state s 2 , the planner can decide on the action to execute at s 2 . In this way, the agent’s choice of trajectory unfolds over time during execution, rather than being completely pre-planned before execution begins. While this may result in a trajectory that is longer than an off-line optimal one, it can result in achieving the goal faster than off-line planning because the planning and execution are concurrent (Kiesel, Burns, and Ruml 2015; Copyright c  2017, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved. Cserna et al. 2016). This setting also models situations in which an agent’s goals can be updated during execution, re- quiring on-line replanning. The ﬁrst contribution of this paper is to point out that the on-line setting differs from the off-line one in that it may be advantageous for the planner to select a slower action to execute at s 2 even when a faster one is known to reach the same resulting state s 3 . This is because the longer action will give the agent more time to plan before reaching s 3 . This might result in a better decision at s 3 , allowing the agent to reach a goal sooner. If the decision is substantially better, the difference may even be large enough to offset the delay due to the slower action. Anyone who has slowed down while driving on a highway in order to have more time to study a map before passing a crucial exit is intuitively familiar with this scenario. We generalize this reasoning to cover actions that do not immediately lead to the same state s 3 . The second contribution of this paper is a practical on- line planning algorithm, Slo’RTS (pronounced Slow-are- tee-ess), that takes this observation into account. We work in the paradigm of forward state-space search, using real- time heuristic search algorithms that perform limited looka- head search and then use the lookahead frontier to inform ac- tion selection. When operating in a domain that has durative actions, whose execution times can be different, Slo’RTS takes actions’ durations into account, estimating the effect on decision-making at future states. In this way, Slo’RTS reasons about its own behavior; in other words, it engages in metareasoning. We implement and test Slo’RTS in some simple gridworld benchmarks, ﬁnding that its metareasoning can indeed result in better agent behavior. To our knowledge, this is the ﬁrst example of a planning algorithm that can dy- namically plan to give itself more time to think without as- suming that the world is static. More generally, this work is part of a recent resurgence of interest in metareasoning in heuristic search, illustrating how this beautiful idea can yield practical beneﬁts. Previous Work We brieﬂy review the real-time heuristic search and meta- reasoning algorithms that Slo’RTS is based on. Given our objective to minimize time to goal, we will assume that plan cost represents makespan. Proceedings of the Twenty-Seventh International Conference on Automated Planning and Scheduling (ICAPS 2017) for On-Line Planning with Durative Actions 56