Published as a conference paper at ICLR 2021 Sim: D IFFERENTIABLE SIMULATION FOR SYSTEM IDENTIFICATION AND VISUOMOTOR CONTROL https://gradsim.github.io Krishna Murthy Jatavallabhula * 1,3,4 , Miles Macklin *2 , Florian Golemo 1,3 , Vikram Voleti 3,4 , Linda Petrini 3 , Martin Weiss 3,4 , Breandan Considine 3,5 , Jérôme Parent-Lévesque 3,5 , Kevin Xie 2,6,7 , Kenny Erleben 8 , Liam Paull 1,3,4 , Florian Shkurti 6,7 , Derek Nowrouzezahrai 3,5 , and Sanja Fidler 2,6,7 1 Montreal Robotics and Embodied AI Lab, 2 NVIDIA, 3 Mila, 4 Université de Montréal, 5 McGill, 6 University of Toronto, 7 Vector Institute, 8 University of Copenhagen Figure 1: Sim is a unified differentiable rendering and multiphysics framework that allows solving a range of control and parameter estimation tasks (rigid bodies, deformable solids, and cloth) directly from images/video. ABSTRACT We consider the problem of estimating an object’s physical properties such as mass, friction, and elasticity directly from video sequences. Such a system identification problem is fundamentally ill-posed due to the loss of information during image formation. Current solutions require precise 3D labels which are labor-intensive to gather, and infeasible to create for many systems such as deformable solids or cloth. We present Sim, a framework that overcomes the dependence on 3D supervision by leveraging differentiable multiphysics simulation and differentiable rendering to jointly model the evolution of scene dynamics and image formation. This novel combination enables backpropagation from pixels in a video sequence through to the underlying physical attributes that generated them. Moreover, our unified computation graph – spanning from the dynamics and through the rendering process – enables learning in challenging visuomotor control tasks, without relying on state-based (3D) supervision, while obtaining performance competitive to or better than techniques that rely on precise 3D labels. 1 I NTRODUCTION Accurately predicting the dynamics and physical characteristics of objects from image sequences is a long-standing challenge in computer vision. This end-to-end reasoning task requires a fundamental understanding of both the underlying scene dynamics and the imaging process. Imagine watching a short video of a basketball bouncing off the ground and ask: “Can we infer the mass and elasticity of the ball, predict its trajectory, and make informed decisions, e.g., how to pass and shoot?” These seemingly simple questions are extremely challenging to answer even for modern computer vision models. The underlying physical attributes of objects and the system dynamics need to be modeled and estimated, all while accounting for the loss of information during 3D to 2D image formation. Depending on the assumptions on the scene structre and dynamics, three types of solutions exist: black, grey, or white box. Black box methods (Watters et al., 2017; Xu et al., 2019b; Janner et al., 2019; Chang et al., 2016) model the state of a dynamical system (such as the basketball’s trajectory in time) as a learned embedding of its states or observations. These methods require few prior assumptions about the system itself, but lack interpretability due to entangled variational factors (Chen et al., 2016) or due to the ambiguities in unsupervised learning (Greydanus et al., 2019; Cranmer et al., 2020b). Recently, grey box methods (Mehta et al., 2020) leveraged partial knowledge about the system dynamics to improve performance. In contrast, white box methods (Degrave et al., 2016; * Equal contribution 1 arXiv:2104.02646v1 [cs.CV] 6 Apr 2021