EUROGRAPHICS 2008 / G. Drettakis and R. Scopigno Volume 27 (2008), Number 2 (Guest Editors) © 2007 The Author(s) Journal compilation © 2007 The Eurographics Association and Blackwell Publishing Ltd. Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA. Articulated Object Reconstruction and Markerless Motion Capture from Depth Video Yuri Pekelny and Craig Gotsman Center for Graphics and Geometric Computing Technion, Israel From depth images to skins to full skeletal 3D models Abstract We present an algorithm for acquiring the 3D surface geometry and motion of a dynamic piecewise-rigid object using a single depth video camera. The algorithm identifies and tracks the rigid components in each frame, while accumulating the geometric information acquired over time, possibly from different viewpoints. The algorithm also reconstructs the dynamic skeleton of the object, thus can be used for markerless motion capture. The acquired model can then be animated to novel poses. We show the results of the algorithm applied to synthetic and real depth video. Categories and Subject Descriptors (according to ACM CCS): I.3.5 [Computer Graphics]: Computational Geome- try and Object Modeling I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism 1. Introduction Traditional 3D scanning applications involve only static subjects, and the main challenge in these applications is to produce an accurate digital model of the scene geome- try. Over the past decade, a multitude of algorithms have been proposed to address this problem, and by now it may be considered (almost) solved. Thus attention is shifting to dealing with dynamic scenes, i.e. ones in which the subjects are moving. Since the scene is dynamic, at first glance it may seem that the problem is not well-defined. What does scanning a scene in which the geometry is constantly changing mean ? What do we expect as the output of this process ? The problem is compounded by the fact that in order to capture any motion accurately, we must sense the scene at real-time rates, a technological challenge for the scan- ning device in its own right. To address the last challenge first, it seems that the most suitable sensor to use for dynamic scenes is the so-called depth video camera. Such a camera provides an image of the scene, where each pixel contains not only traditional intensity information, but also the geometric distance from the camera to the subject at that pixel. A number of commercial cameras generating this information at video rates have appeared over recent years [CVCM, 3DV, PS, VZS], and the state-of-the-art of the technologies in- volved is improving rapidly. Prices are also dropping, so we expect that depth video cameras will be available at reasonable cost within the next few years. The simplest version of the dynamic scene scanning problem is motion capture of a piecewise-rigid 3D sub- ject (such as a person). This means that as output we are not interested in the precise geometry of the subject, rather in the rough motion of a “skeleton” representing its rigid parts, of which there are usually just a few. Mo- tion capture is performed today using elaborate rigs in- volving markers placed on the subject, and it would be useful to have a device capable of markerless motion capture based only on depth cameras. This is the objec- tive of a number of commercial companies [3DV,PS] who are developing depth cameras for use as motion capture and gesture recognition devices in interactive consumer-level gaming applications. A more challenging version of the problem is full 3D scanning of dynamic piecewise-rigid 3D objects. The