Neural Monocular 3D Human Motion Capture with Physical Awareness SOSHI SHIMADA, Max Planck Institute for Informatics, Saarland Informatics Campus, Germany VLADISLAV GOLYANIK, Max Planck Institute for Informatics, Saarland Informatics Campus, Germany WEIPENG XU, Facebook Reality Labs, USA PATRICK PÉREZ, Valeo.ai, France CHRISTIAN THEOBALT, Max Planck Institute for Informatics, Saarland Informatics Campus, Germany Fig. 1. From an input monocular video, our method for markerless 3D human motion capture estimates global human poses which obey (bio-)physical constraints. In contrast to existing methods with physical awareness, our approach is neural and fully diferentiable; it allows learning motion priors and the associated physical properties from the data. We can reconstruct more challenging and faster motions compared to the state of the art, with fewer artefacts such as jiter, foot-floor penetration and unnatural body postures. Thanks to these properties, our method can be used to directly drive a virtual character or visualise joint torques. (Lef:) Results of our method on diferent sequences from the input and side views. (Right:) Applications in motion analysis by force visualisation and virtual character animation. We present a new trainable system for physically plausible markerless 3D human motion capture, which achieves state-of-the-art results in a broad range of challenging scenarios. Unlike most neural methods for human mo- tion capture, our approach, which we dub łphysionicalž, is aware of physical and environmental constraints. It combines in a fully-diferentiable way several key innovations, i.e., 1) a proportional-derivative controller, with gains predicted by a neural network, that reduces delays even in the presence of fast motions, 2) an explicit rigid body dynamics model and 3) a novel op- timisation layer that prevents physically implausible foot-foor penetration as a hard constraint. The inputs to our system are 2D joint keypoints, which are canonicalised in a novel way so as to reduce the dependency on intrinsic camera parametersÐboth at train and test time. This enables more accurate Authors’ addresses: Soshi Shimada, Max Planck Institute for Informatics, Saarland Infor- matics Campus, Saarbrücken, Germany, sshimada@mpi-inf.mpg.de; Vladislav Golyanik, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Ger- many, golyanik@mpi-inf.mpg.de; Weipeng Xu, Facebook Reality Labs, Pittsburgh, USA, xuweipeng@fb.com; Patrick Pérez, Valeo.ai, Paris, France, patrick.perez@valeo.com; Christian Theobalt, Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany, theobalt@mpi-inf.mpg.de. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proft or commercial advantage and that copies bear this notice and the full citation on the frst page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). © 2021 Copyright held by the owner/author(s). 0730-0301/2021/8-ART83 https://doi.org/10.1145/3450626.3459825 global translation estimation without generalisability loss. Our model can be fnetuned only with 2D annotations when the 3D annotations are not available. It produces smooth and physically-principled 3D motions in an interactive frame rate in a wide variety of challenging scenes, including newly recorded ones. Its advantages are especially noticeable on in-the- wild sequences that signifcantly difer from common 3D pose estimation benchmarks such as Human 3.6M and MPI-INF-3DHP. Qualitative results are provided in the supplementary video. CCS Concepts: · Computer methodologies Computer graphics; · Motion capture; Additional Key Words and Phrases: Monocular 3D Human Motion Capture, Physical Awareness, Global 3D, Physionical Approach. ACM Reference Format: Soshi Shimada, Vladislav Golyanik, Weipeng Xu, Patrick Pérez, and Christian Theobalt. 2021. Neural Monocular 3D Human Motion Capture with Physical Awareness. ACM Trans. Graph. 40, 4, Article 83 (August 2021), 15 pages. https://doi.org/10.1145/3450626.3459825 1 INTRODUCTION 3D human motion capture is an actively researched area enabling many applications ranging from human activity recognition to sports analysis, virtual-character animation, flm production, human- computer interaction and mixed reality. Since marker-based and ACM Trans. Graph., Vol. 40, No. 4, Article 83. Publication date: August 2021.