Aux-AIRL: End-to-End Self-Supervised Reward Learning for Extrapolating beyond Suboptimal Demonstrations Yuchen Cui *1 Bo Liu *1 Akanksha Saran 1 Stephen Giguere 1 Peter Stone 12 Scott Niekum 1 Abstract Real-world human demonstrations are often sub- optimal. How to extrapolate beyond suboptimal demonstration is an important open research ques- tion. In this ongoing work, we analyze the success of a previous state-of-the-art self-supervised re- ward learning method that requires four sequential optimization steps, and propose a simple end-to- end imitation learning method Aux-ARIL that ex- trapolates from suboptimal demonstrations with- out requiring multiple optimization steps. 1. Introduction The advent of autonomous agents in our homes and work- places is contingent on their ability to adapt in novel, varied, and dynamic environments and learn new tasks from end- users in these environments. A natural approach is for end- users to teach learning agents by showing demonstrations of how a task should be performed, which is known as learning from demonstration (LfD) or imitation learning (Argall et al., 2009). Typically, LfD algorithms assume users provide near optimal demonstrations, which often does not hold true as novice end-users can provide suboptimal demonstrations. Instead of discarding suboptimal demonstrations, a recent suite of self-supervised methods (Brown et al., 2019a;b; Chen et al., 2020) have shown how to leverage this subop- timal data to learn reward functions that can induce behav- iors extrapolating beyond the demonstrator’s performance. Brown et al. (2019b) propose disturbance-ranked reward extrapolation (D-REX), which bootstraps off suboptimal demonstrations to synthesize noise-injected trajectory roll- outs. These synthesized trajectories with varying levels of noise are then used to train an idealized reward function, which is then used with reinforcement learning (RL) (Sutton et al., 1998) to learn the ﬁnal policy. By improving upon * Equal contribution 1 Department of Computer Science, Uni- versity of Texas at Austin, Austin, Texas, USA 2 Sony AI, USA. Correspondence to: Yuchen Cui <yuchencui@utexas.edu>, Bo Liu <bliu@cs.utexas.edu>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). this approach, Chen et al. (2020) propose self-supervised reward regression (SSRR), which leverages adversarial in- verse reinforcement learning (AIRL) (Fu et al., 2017) to generate synthetic demonstrations and then learn a reward function that is aware of the amount of noise injected to the policy during self-supervision. SSRR ﬁrst ﬁts the noise- performance curve with a sigmoid function, and then re- gresses a reward function to the resultant noise-performance curve. They show that training a RL policy on this regressed noise-aware reward function outperforms D-REX in three Mujoco environments (Todorov et al., 2012). In this work, we perform an in-depth study on the mecha- nisms of SSRR. While we observe that most steps of SSRR are essential to its success, we also ﬁnd that the criterion used to approach reward regression in this work may not be the most optimal. Chen et al. (2020) demonstrate empiri- cally that ﬁtting a sigmoid function to the noise-performance curve generates a regression target for reward functions that will induce better-than-demonstrator agents, outperforming D-REX. However, we learn that the sigmoid is not the only function that can give a high extrapolation performance and at the same time ﬁtting the function parameters to the AIRL reward is not a necessary step. We hypothesize what is critical for extrapolation is to ensure is a steep drop of reward estimation when noise is injected. We propose to directly apply an auxiliary loss on AIRL (Fu et al., 2017), which enforces trajectories without noise to have higher re- wards than trajectories with noise. We show that this simple auxiliary objective of creating a separation between the pre- dicted performance of policies with and without noise, can extrapolate beyond suboptimal demonstrations in a more ef- ﬁcient manner— replacing the multi-stage training process of SSRR with one single step. 2. Background In this section we introduce the problem setting and give an overview of three related works. 2.1. Problem Setup We consider sequential decision making problems modeled as Markov Decision Processes (MDPs). An MDP is given