Published as a conference paper at ICLR 2020 I MPLEMENTATION MATTERS IN D EEP P OLICY G RADIENTS :AC ASE S TUDY ON PPO AND TRPO Logan Engstrom 1* , Andrew Ilyas 1* , Shibani Santurkar 1 , Dimitris Tsipras 1 , Firdaus Janoos 2 , Larry Rudolph 1,2 , and Aleksander M ˛ adry 1 1 MIT 2 Two Sigma {engstrom,ailyas,shibani,tsipras,madry}@mit.edu rudolph@csail.mit.edu, firdaus.janoos@twosigma.com ABSTRACT We study the roots of algorithmic progress in deep policy gradient algorithms through a case study on two popular algorithms: Proximal Policy Optimization (PPO) and Trust Region Policy Optimization (TRPO). Specifically, we investigate the consequences of “code-level optimizations:” algorithm augmentations found only in implementations or described as auxiliary details to the core algorithm. Seemingly of secondary importance, such optimizations turn out to have a major impact on agent behavior. Our results show that they (a) are responsible for most of PPO’s gain in cumulative reward over TRPO, and (b) fundamentally change how RL methods function. These insights show the difficulty and importance of attributing performance gains in deep reinforcement learning. 1 I NTRODUCTION Deep reinforcement learning (RL) algorithms have fueled many of the most publicized achievements in modern machine learning (Silver et al., 2017; OpenAI, 2018; Abbeel & Schulman, 2016; Mnih et al., 2013). However, despite these accomplishments, deep RL methods still are not nearly as reliable as their (deep) supervised learning counterparts. Indeed, recent research found the existing deep RL methods to be brittle (Henderson et al., 2017; Zhang et al., 2018), hard to reproduce (Hen- derson et al., 2017; Tucker et al., 2018), unreliable across runs (Henderson et al., 2017; 2018), and sometimes outperformed by simple baselines (Mania et al., 2018). The prevalence of these issues points to a broader problem: we do not understand how the parts comprising deep RL algorithms impact agent training, either separately or as a whole. This unsat- isfactory understanding suggests that we should re-evaluate the inner workings of our algorithms. Indeed, the overall question motivating our work is: how do the multitude of mechanisms used in deep RL training algorithms impact agent behavior? Our contributions. We analyze the underpinnings of agent behavior—both through the traditional metric of cumulative reward, and by measuring more fine-grained algorithmic properties. As a first step, we conduct a case study of two of the most popular deep policy-gradient methods: Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy Optimization (PPO) (Schulman et al., 2017). These two methods are closely related: PPO was originally devel- oped as a refinement of TRPO. We find that much of the observed improvement in reward brought by PPO may come from seem- ingly small modifications to the core algorithm which we call code-level optimizations. These op- timizations are either found only in implementations of PPO, or are described as auxiliary details and are not present in the corresponding TRPO baselines 1 . We pinpoint these modifications, and perform an ablation study demonstrating that they are instrumental to the PPO’s performance. * Equal contribution. Work done in part while interning at Two Sigma. 1 Note that these code-level optimizations are separate from “implementation choices” like the choice of PyTorch versus TensorFlow in that they intentionally change the training algorithm’s operation. 1 arXiv:2005.12729v1 [cs.LG] 25 May 2020