Online BayesSim for Combined Simulator Parameter Inference and Policy Improvement Rafael Possas *† Lucas Barcelos † Rafael Oliveira † Dieter Fox *‡ Fabio Ramos *† * NVIDIA † University of Sydney ‡ University of Washington Abstract—In this paper, we study the integration of simu- lation parameter inference with both model-free reinforcement learning and model-based control in a novel sequential algo- rithm that alternates between learning a better estimation of parameters and improving the controller. Experimental results suggest that both control strategies have better performance when compared to traditional domain randomization methods. I. INTRODUCTION Advancements in simulation have allowed robotics learn- ing to become more efficient and realistic in recent years [1][2][3]. However, there is still a range of possible im- provements in simulation techniques before they can capture reality with all its complexities. ”Reality gap” is a term used when the environment model used in a simulator does not represent the targeted system accurately enough so we can achieve the desirable performance when deploying a robot in the real world. It is known that oversimplified assumptions or insufficient numerical precision in solvers can play a major role in how well a simulator models its target desired system. Existing prior knowledge about simulation parameters is often incor- porated through a series of trial and error experiments until a good approximation is reached. This process is inefficient and time consuming as it involves running non-optimal control strategies on expensive and fragile robots. In this work, we build upon the idea of using probabilistic inference to learn distributions over simulation parameters [3]. This technique leverages recent advances in likelihood- free inference (LFI) [4] for Bayesian analysis to learn poste- riors over simulation parameters based on rollouts obtained from the target system. Previous work [3] managed to learn distributions over parameters, but it required a reasonable initial controller that was able to explore the dynamical system in relevant regions of the state-space. Alternatively, in this paper we propose an end-to-end approach that combines posterior updates with controller improvement. II. ONLINE BAYESSIM Here we present the main contribution of the paper: Online BayesSim. We leverage previous work in likelihood-free inference to simultaneously improve a controller and learn a distribution over the simulator parameters. Additionally, we propose a methodology to automate the computation of a low-dimensional representation of state-action trajectories using Recurrent Neural Networks (RNN). The difficulty in representing high-dimensional time series has been one of the major reasons why LFI methods do not scale well to higher dimensional spaces. We show that with an RNN, latent representations from entire trajectories can be learnt and used directly for the posterior estimation. This removes the need to manually define meaningful summary statistics, which sometimes, can be a quite difficult and complex task. The use of Bayesian inference can be borrowed from more traditional statistics methods such as approximate Bayesian computation (ABC) [5]. Improvements over this method such as Rejection ABC [6], Markov Chain Monte Carlo ABC (MCMC-ABC) [7], Sequential Monte Carlo ABC (SMC- ABC) [8] and finally the ǫ-free approach [4] have enabled Bayesian inference on a wide range of problems. Formally, we start with a stochastic controller π β (a t |s t ) and no prior knowledge of the true parameters represented by an uniform prior p(θ). In the first iteration π β (a t |s t ) is initialised with samples from the uniform prior p(θ). Trajectories S s , A s are collected using current π β (a t |s t ) which are then used to update our Mixture of Gaussians model q φ (θ|z). New data S r , A r is then collected in the target system (e.g. real environment, proxy simulator and etc) using the same controller which is subsequently used to recover a new posterior and update the control strategy. p(θ|S, A = S r , A r ). The prior p(θ) is then replaced by the new posterior and the algorithm iterates until we achieve the desired controller performance. Details can be seen in Algorithm 1. III. RESULTS A. Classic Control Tasks Online BayesSim have been evaluated on several control tasks as shown on table I. We have compared the log- likelihood of the posteriors recovered by our algorithm against recent work in LFI. It can be seen that Online BayesSim has outperformed current work in most of the tasks. This shows that online learning coupled with iterative updates can result in sharper posteriors. B. Experiments on a physical robot This section presents experimental results with a physical robot equipped with a skid-steering drive mechanism (Fig- ure 1). We modelled the kinematics of the robot based on a modified unicycle model, which accounts for skidding via an additional parameter [9]. The parameters to be estimated via Online BayesSim are the robot’s wheel radius, axial distance, i.e. the distance between the wheels, and the displacement of the robot’s instant centre of rotation (ICR) from the robot’s