NRST: Non-rigid Surface Tracking from Monocular Video Marc Habermann 1[0000−0003−3899−7515] , Weipeng Xu 1[0000−0001−9548−5108] , Helge Rhodin 2[0000−0003−2692−0801] , Michael Zollh¨ ofer 3[0000−0003−1219−0625] , Gerard Pons-Moll 1[0000−0001−5115−7794] , and Christian Theobalt 1[0000−0001−6104−6625] 1 Max Planck Institute for Informatics, Saarbruecken 66123, Germany https://www.mpi-inf.mpg.de/home/ {mhaberma, wxu, gpons, theobalt}@mpi-inf.mpg.de 2 EPFL, Lausanne CH-1015, Switzerland https://www.epfl.ch/ helge.rhodin@epfl.ch 3 Stanford University, Stanford CA 94305, USA https://www.stanford.edu/ zollhoefer@cs.stanford.edu Abstract. We propose an eﬃcient method for non-rigid surface tracking from monocular RGB videos. Given a video and a template mesh, our al- gorithm sequentially registers the template non-rigidly to each frame. We formulate the per-frame registration as an optimization problem that in- cludes a novel texture term speciﬁcally tailored towards tracking objects with uniform texture but ﬁne-scale structure, such as the regular micro- structural patterns of fabric. Our texture term exploits the orientation information in the micro-structures of the objects, e.g., the yarn patterns of fabrics. This enables us to accurately track uniformly colored materi- als that have these high frequency micro-structures, for which traditional photometric terms are usually less eﬀective. The results demonstrate the eﬀectiveness of our method on both general textured non-rigid objects and monochromatic fabrics. 1 Introduction In this paper, we propose NRST, an eﬃcient method for non-rigid surface track- ing from monocular RGB videos. Capturing the non-rigid deformation of a dy- namic surface is an important and long-standing problem in computer vision. It has a wide range of real world applications in ﬁelds such as virtual/augmented reality, medicine and visual eﬀects. Most of the existing methods are based on multi-view imagery, where expensive and complicated system setups are re- quired [3,25,23]. There also exist methods that rely on only a single depth or RGB-D camera [42,44,19,18]. However, these sensors are not as ubiquitous as RGB cameras, and these methods cannot be applied on plenty of existing video arXiv:2107.02407v2 [cs.CV] 12 Jul 2021