Safety Margins for Reinforcement Learning Alexander Grushin Galois, Inc. agrushin@galois.com Walt Woods Galois, Inc. waltw@galois.com Alvaro Velasquez University of Colorado Boulder alvaro.velasquez@colorado.edu Simon Khan Air Force Research Laboratory simon.khan@us.af.mil Abstract—Any autonomous controller will be unsafe in some situations. The ability to quantitatively identify when these unsafe situations are about to occur is crucial for drawing timely human oversight in, e.g., freight transportation applications. In this work, we demonstrate that the true criticality of an agent’s situation can be robustly deﬁned as the mean reduction in reward given some number of random actions. Proxy criticality metrics that are computable in real-time (i.e., without actually simulating the effects of random actions) can be compared to the true criticality, and we show how to leverage these proxy metrics to generate safety margins, which directly tie the consequences of potentially incorrect actions to an anticipated loss in overall performance. We evaluate our approach on learned policies from APE-X and A3C within an Atari environment, and demonstrate how safety margins decrease as agents approach failure states. The integration of safety margins into programs for monitoring deployed agents allows for the real-time identiﬁcation of potentially catastrophic situations. I. I NTRODUCTION Broader adoption of autonomous controllers for real-world applications relies on the ability to ensure that any beneﬁts of automation come without unacceptable costs; for example, in freight transportation, accidents result not only in damages to the autonomous vehicle, but also loss of cargo and loss of life. There is signiﬁcant other work in the ﬁeld on improving the reliability of these controllers. Instead, we seek to understand and quantify when a controller might be on the brink of disaster, in order to raise an alert for timely human oversight. During the past several years, criticality metrics have been developed for gauging the importance of any given point in time to an agent’s overall success [1]–[4]. These metrics are typically validated by injecting random or adversarially worst-case actions at times that have the top-N largest metric values; if this results in a large measured reduction in reward, then the metric is considered to be more accurate [1], [4]. Such an evaluation approach lacks ground truth (i.e., the true criticality at any speciﬁc time is unknown), and is liable to miss false negative errors, where the metric has a low value, but true criticality is high and the agent is in imminent danger. Thus, the accuracy of existing metrics is not well-established; furthermore, it can be unclear what a given metric value implies in terms of potential consequences to the agent. Instead, we re-label these metrics as proxy criticality mea- surements, and introduce a deﬁnition for true criticality at some point in time t as the expected reduction in reward when an agent executes a sequence of n consecutive random actions (beginning at t), rather than the actions suggested by its policy; this is a modiﬁcation of a deﬁnition given in [2]. We present an algorithm for accurately approximating the expected value as a mean reward reduction, in a tractable way (albeit not in real-time). By analyzing the relationship between proxy and true criticality, we can identify both false positives and false negatives. Crucially, we show that even a noisy relationship between proxy and true criticality can provide actionable information. Speciﬁcally, we deﬁne the safety margin at some time t as the maximum number of random actions which, if executed beginning at t, have only an α chance of impairing agent performance more than some tolerance ζ , deﬁned in the discounted reward space of the application. Intuitively, this can be illustrated in the game of Pong: a “safe” policy that keeps the ball centered on the agent’s paddle can afford mistakes, whereas an agent that keeps the ball near the edge of its paddle may lose a point if it makes a mistake. For proxy criticality metrics that can be computed in real-time, safety margins result in a lookup table that can be consulted anytime, allowing autonomous systems to automatically ﬂag themselves as needing human oversight in critical situations. II. RESULTS We ran experiments with the Atari game BeamRider, where players pilot a ship and attempt to destroy enemy ships. Using a proxy criticality metric adapted from [1], which takes the maximum predicted Q value for APE-X or action log likelihood for A3C, and subtracts the minimum such quantity, we constructed the safety margin tables shown in Fig. 1. These lookup tables provide approximately 1 − α conﬁdence that a given, tolerable loss in performance (Y-axis) will not be TABLE I PREDICTIVE CAPABILITIES OF SAFETY MARGINS ζ algorithm steps before death safety margin 0.5 APE-X 1 0.00 ± 0.00 2 0.50 ± 0.87 4 0.99 ± 0.73 average 0.97 ± 0.86 A3C 1 0.53 ± 0.50 2 1.00 ± 0.00 4 0.82 ± 0.38 average 3.09 ± 2.68 1.0 APE-X 1 2.00 ± 0.00 2 2.50 ± 0.87 4 4.50 ± 2.23 average 4.72 ± 2.50 A3C 1 2.48 ± 1.75 2 4.00 ± 0.00 4 3.42 ± 2.66 average 9.20 ± 5.24 1 arXiv:2307.13642v1 [cs.LG] 25 Jul 2023