EUROGRAPHICS 2008 / G. Drettakis and R. Scopigno Volume 27 (2008), Number 2
(Guest Editors)
© 2007 The Author(s)
Journal compilation © 2007 The Eurographics Association and Blackwell Publishing Ltd.
Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and
350 Main Street, Malden, MA 02148, USA.
Articulated Object Reconstruction and Markerless Motion
Capture from Depth Video
Yuri Pekelny and Craig Gotsman
Center for Graphics and Geometric Computing
Technion, Israel
From depth images to skins to full skeletal 3D models
Abstract
We present an algorithm for acquiring the 3D surface geometry and motion of a dynamic piecewise-rigid object
using a single depth video camera. The algorithm identifies and tracks the rigid components in each frame, while
accumulating the geometric information acquired over time, possibly from different viewpoints. The algorithm also
reconstructs the dynamic skeleton of the object, thus can be used for markerless motion capture. The acquired
model can then be animated to novel poses. We show the results of the algorithm applied to synthetic and real
depth video.
Categories and Subject Descriptors (according to ACM CCS): I.3.5 [Computer Graphics]: Computational Geome-
try and Object Modeling I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism
1. Introduction
Traditional 3D scanning applications involve only static
subjects, and the main challenge in these applications is
to produce an accurate digital model of the scene geome-
try. Over the past decade, a multitude of algorithms have
been proposed to address this problem, and by now it
may be considered (almost) solved. Thus attention is
shifting to dealing with dynamic scenes, i.e. ones in
which the subjects are moving.
Since the scene is dynamic, at first glance it may seem
that the problem is not well-defined. What does scanning
a scene in which the geometry is constantly changing
mean ? What do we expect as the output of this process ?
The problem is compounded by the fact that in order to
capture any motion accurately, we must sense the scene
at real-time rates, a technological challenge for the scan-
ning device in its own right.
To address the last challenge first, it seems that the most
suitable sensor to use for dynamic scenes is the so-called
depth video camera. Such a camera provides an image of
the scene, where each pixel contains not only traditional
intensity information, but also the geometric distance
from the camera to the subject at that pixel. A number of
commercial cameras generating this information at video
rates have appeared over recent years [CVCM, 3DV, PS,
VZS], and the state-of-the-art of the technologies in-
volved is improving rapidly. Prices are also dropping, so
we expect that depth video cameras will be available at
reasonable cost within the next few years.
The simplest version of the dynamic scene scanning
problem is motion capture of a piecewise-rigid 3D sub-
ject (such as a person). This means that as output we are
not interested in the precise geometry of the subject,
rather in the rough motion of a “skeleton” representing
its rigid parts, of which there are usually just a few. Mo-
tion capture is performed today using elaborate rigs in-
volving markers placed on the subject, and it would be
useful to have a device capable of markerless motion
capture based only on depth cameras. This is the objec-
tive of a number of commercial companies [3DV,PS]
who are developing depth cameras for use as motion
capture and gesture recognition devices in interactive
consumer-level gaming applications.
A more challenging version of the problem is full 3D
scanning of dynamic piecewise-rigid 3D objects. The