Accelerating Online Reinforcement Learning via Supervisory Safety Systems Benjamin Evans 1 , Johannes Betz 2 , Hongrui Zheng 2 , Herman A. Engelbrecht 1 , Rahul Mangharam 2 , and Hendrik W. Jordaan 1 Abstract— Deep reinforcement learning (DRL) is a promising method to learn control policies for robots only from demon- stration and experience. To cover the whole dynamic behaviour of the robot, the DRL training is an active exploration process typically derived in simulation environments. Although this simulation training is cheap and fast, applying DRL algorithms to real-world settings is difficult. If agents are trained until they perform safely in simulation, transferring them to physical systems is difficult due to the sim-to-real gap caused by the difference between the simulation dynamics and the physical robot. In this paper, we present a method of online training a DRL agent to drive autonomously on a physical vehicle by using a model-based safety supervisor. Our solution uses a supervisory system to check if the action selected by the agent is safe or unsafe and ensure that a safe action is always implemented on the vehicle. With this, we can bypass the sim-to- real problem while training the DRL algorithm safely, quickly, and efficiently. We provide a variety of real-world experiments where we train online a small-scale, physical vehicle to drive autonomously with no prior simulation training. The evaluation results show that our method trains agents with improved sample efficiency while never crashing, and the trained agents demonstrate better driving performance than those trained in simulation. I. I NTRODUCTION A. Motivation Deep reinforcement learning (DRL) is a growing, popular method in autonomous system control [1]. Like humans that learn from experiences over time, DRL algorithms learn control mappings from sensor readings to planning commands using only observations from the environment and reward signals defined by the engineer. In contrast to humans who learn in the real world, DRL agents are usually trained in simulation. These simulation environments require accurate sensor and dynamics models to represent the robot and its surrounding environment. Unfortunately, the accuracy of simulation environments is limited to maintain good computation time, resulting in the sim-to-real gap when the simulation-trained DRL agent is transferred to a real- world system [2]. It is desirable to train an agent directly on the robot, thus altogether avoiding the sim-to-real gap [3]. An inherent challenge in the online training of DRL algorithms on real-world robots is that DRL algorithms rely on crashing 1 B. Evans, H.A. Engelbrecht, and H.W. Jodaan are with the Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa (e-mail: bdevans, hebecht, wjordaan@sun.ac.za) 2 J. Betz, H. Zheng, and R. Mangharam are with the Department of Elec- trical and Systems Engineering, University of Pennsylvania, Philadelphia, USA (e-mail: joebetz, hongruiz, rahulm@seas.upenn.edu) during training, meaning that training on a physical robot is very difficult or nearly impossible [4]. Crashing physical robots is expensive and a safety concern for the surrounding humans [5]. Therefore, being able to train DRL agents safely, crash-free onboard physical robots would result in an interesting technology that enables the application of DRL agents to more physical platforms. Further, it can be expected that bypassing the sim-to-real gap will lead to improved DRL policies. B. Contributions In this paper, we provide insights on online DRL train- ing and testing on a real-world vehicle, accelerated by a supervisory system. We present a supervisory safety system (SSS) capable of training a DRL agent onboard a physical autonomous vehicle with no prior simulation training. The supervisory safety system uses a viability kernel (set of safe states) and vehicle model to check if the DRL agent’s action is safe. If the DRL action is unsafe, a safe action from a pure pursuit controller is implemented. This work has three main contributions: We combine a supervisory system and a DRL algorithm to achieve safe and efficient policy training. We demonstrate that combining the supervisory system and DRL algorithm results in the safe and robust online training of a real-world robot system. We demonstrate that an agent trained with the SSS can effectively bypass the sim-to-real gap by outperforming an agent trained in simulation. II. RELATED WORK In this section, we discuss works related to DRL for autonomous vehicles, safe DRL training, and online DRL training. DRL for autonomous vehicles: Many variations of DRL-based methods (model-based, model-free) have been implemented to derive control commands for autonomous vehicles from raw sensor inputs. The authors of [6], [7] used Deep Q-Learning (DQN) to learn steering manoeuvres for autonomous systems and [8], [9] used the soft-actor- critic (SAC) algorithm. Numerous deep learning studies are only evaluated in simulation because they are not practically feasible [10], [11], [12]. Of the DRL algorithms applied to physical systems, the dominant approach in the literature is to train the agents in simulation before transferring them to real vehicles [13], [14], [15]. Evaluations show that DRL is arXiv:2209.11082v1 [cs.RO] 22 Sep 2022