arXiv:2006.11561v1 [cs.LG] 20 Jun 2020 Adversarial Stochastic Shortest Path Aviv Rosenberg * Yishay Mansour † June 23, 2020 Abstract Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In this paper we consider adversarial SSPs that also account for adversarial changes in the costs over time, while the dynamics (i.e., transition function) remains unchanged. Formally, an agent interacts with an SSP environment for K episodes, the cost function changes arbitrarily between episodes, and the fixed dynamics are unknown to the agent. We give high probability regret bounds of O( √ K) assuming all costs are strictly positive, and O(K 3/4 ) for the general case. To the best of our knowledge, we are the first to consider this natural setting of adversarial SSP and obtain sub-linear regret for it. 1 Introduction Stochastic shortest path (SSP) is one of the most basic models in reinforcement learning (RL). In SSP the goal of the agent is to reach a predefined goal state in minimum expected cost, and it captures a wide variety of natural scenarios, such as car navigation and game playing. An important aspect that the SSP model fails to capture is the changes in the environment over time (for example, changes in traffic when navigating a car). This aspect of the environment is theoretically modeled by adversarial Markov decision processes (MDPs), in which the cost function may change arbitrarily over time, while still assuming a fixed transition function. In this work we present the adversarial SSP model that combines SSPs with adversarial MDPs. In this model, the agent interacts with an SSP environment in K episodes, but the cost function changes between episodes arbitrarily. The agent’s objective is to reach the goal state in every episode while minimizing its total expected cost, and its performance is measured by the regret – the difference between the agent’s total cost and the expected total cost of the best stationary policy in hindsight. We propose the first algorithms for regret minimization in adversarial SSPs. Our algorithms take recent advances in learning SSP problems [1, 2] – that build upon the optimism in face of uncertainty principle, and combine them with the O-REPS framework [3, 4, 5, 6] for adversarial episodic MDPs – which implements the online mirror descent (OMD) algorithm for online convex optimization. We follow the strategy of [1, 2] for SSPs – we start by assuming all costs are strictly positive and prove O( √ K) regret (which is optimal). Then, using a perturbation argument, we remove this assumption and show that our algorithms obtain O(K 3/4 ) regret. First, we consider a simplified case in which the transition function is known to the learner and the regret should be minimized in expectation. For this case, we establish an efficient O-REPS based algorithm and bound its expected regret. Then, we introduce an improvement that ensures the learner will not run too long before reaching the goal, and show that this yields a high probability regret bound. Finally, we remove the known transition function assumption and combine our algorithm with the confidence set framework of UCRL2 [7]. This allows us to prove a high probability regret bound without knowledge of the transition function. * Tel Aviv University, Israel; avivros007@gmail.com. † Tel Aviv University, Israel and Google Research, Tel Aviv; mansour.yishay@gmail.com. 1