On-Policy Imitation Learning from an Improving Supervisor Ashwin Balakrishna *1 Brijen Thananjeyan *1 Jonathan Lee 1 Arsh Zahed 1 Felix Li 1 Joseph E. Gonzalez 1 Ken Goldberg 1 Abstract Most on-policy imitation algorithms, such as DAgger, are designed for learning with a fixed supervisor. However, there are many settings in which the supervisor improves during policy learning, such as when the supervisor is a human performing a novel task or an improving algo- rithmic controller. We consider learning from an “improving supervisor” and derive a bound on the static-regret of online gradient descent when a converging supervisor policy is used. We present an on-policy imitation learning algorithm, Fol- low the Improving Teacher (FIT), which uses a deep model-based reinforcement learning (deep MBRL) algorithm to provide the sample complex- ity benefits of model-based methods but enable faster training and evaluation via distillation into a reactive controller. We evaluate FIT with ex- periments on the Reacher and Pusher MuJoCo domains using the deep MBRL algorithm, PETS, as the improving supervisor. To the best of our knowledge, this work is the first to formally con- sider the setting of an improving supervisor in on-policy imitation learning. 1. Introduction In on-policy imitation learning, a policy is iteratively trained to match the behavior of a supervisor on a particular task on the distribution of the learned policy. In algorithms such as DAgger (Ross et al., 2011a), the supervisor serves as a labeler, providing feedback on the appropriate controls for states visited by the learner. Ross et al. (2011a) show that DAgger can be interpreted as a no-regret algorithm in the online-learning setting, and provides vanishing regret guar- antees when the policy update step via Follow The Leader * Equal contribution 1 Department of EECS, University of Cali- fornia, Berkeley. Correspondence to: Ashwin Balakrishna <ash- win balakrishna@eecs.berkeley.edu>, Brijen Thananjeyan <bri- jen@eecs.berkeley.edu>. Real-World Sequential Decision Making Workshop at ICML 2019, Copyright 2019 by the author(s). (FTL) has vanishing regret (Ross et al., 2011a; Kakade & Tewari, 2009). Prior work focuses on imitation learning algorithms with a fixed supervisor (Ross et al., 2011a; Sun et al., 2017; Lee et al., 2019; Cheng & Boots, 2018). However, in this work, we consider a convergent sequence of supervisors. This context is motivated by practical scenarios in which the supervisor may improve its task performance substantially as time progresses, e.g., as a human supervisor learns how to play a game they have never played or teleoperate a robot with unfamiliar controls. We investigate how initially suboptimal labeling feedback affects the incurred static regret of the learned policy. This is particularly relevant to long time horizon tasks, in which a large-scale system is designed to improve over time on a difficult task using human experience as feedback. In this work, we show that results are not significantly affected when the supervisor is initially suboptimal, as long as it converges to the desired policy. Learning from improving supervisors also has applications to deep model-based reinforcement learning, which has attracted interest due to the improved sample-efficiency compared to model-free methods (Chua et al., 2018). Re- cent model-based RL algorithms for continuous-control do- mains represent system dynamics with a deep neural net- work, which is updated on-policy, and use model-predictive control (MPC) to generate controls (Chua et al., 2018; Naga- bandi et al., 2018). However, generating controls for dy- namics models represented by deep neural networks often involves significant online computation, making it infeasible to collect high-frequency policy rollouts from the model- based controller. This significantly slows down both train- ing, which requires policy rollouts for policy evaluation, and evaluation at test-time, making direct application of these techniques difficult in many robotic tasks. We focus on this setting in this work. Motivated by the idea of learning from an improving super- visor, we present an on-policy imitation learning algorithm to train a model-based deep reinforcement learning agent using off-policy data from a model-free learner policy. The model-based supervisor is used to generate labels, which are then used to update the learner. This enables fast policy eval-