Tracking a Hand Manipulating an Object Henning Hamer 1 Konrad Schindler 2 Esther Koller-Meier 1 Luc Van Gool 1,3 1 Computer Vision Laboratory 2 Computer Science Department 3 ESAT-PSI/VISICS ETH Zurich TU Darmstadt KU Leuven {hhamer,ebmeier,vangool}@vision.ee.ethz.ch schindler@cs.tu-darmstadt.de luc.vangool@esat.kuleuven.be Abstract We present a method for tracking a hand while it is inter- acting with an object. This setting is arguably the one where hand-tracking has most practical relevance, but poses sig- niﬁcant additional challenges: strong occlusions by the ob- ject as well as self-occlusions are the norm, and classical anatomical constraints need to be softened due to the exter- nal forces between hand and object. To achieve robustness to partial occlusions, we use an individual local tracker for each segment of the articulated structure. The segments are connected in a pairwise Markov random ﬁeld, which enforces the anatomical hand structure through soft con- straints on the joints between adjacent segments. The most likely hand conﬁguration is found with belief propagation. Both range and color data are used as input. Experiments are presented for synthetic data with ground truth and for real data of people manipulating objects. 1. Introduction Visual hand tracking has several important applications, including intuitive human-computer interaction, human be- havior and emotion analysis, safety and process integrity control on the workﬂoor, rehabilitation, and motion cap- ture. Not surprisingly, much research has already gone into computer algorithms for hand tracking. Yet, the majority of contributions have only considered free hands, whereas in many applications the hands will actually be manipulating objects. In this paper, we present for the ﬁrst time a system, which can track the articulated 3D pose of a hand, while the hand interacts with an object (such as depicted in Fig. 1). The presence of objects has a signiﬁcant impact on the complexity and generality of the task. First, the manip- ulated objects will frequently occlude parts of the hand, and hand poses occurring during the process of grabbing or holding will aggravate the problem of self-occlusion (e.g. in Fig. 1 large parts of four ﬁngers are partially or even fully occluded). Second, the hand structure itself is less con- strained in the presence of objects: parameter ranges have Figure 1. The goal of the present work: recovering the articulated 3D structure of the hand during object manipulation. to be widened and some simplifying assumptions derived from human anatomy no longer hold. When in contact with an object, forces are exerted on the hand, resulting in poses which cannot be achieved with the bare hand (e.g. bend- ing ﬁngers backwards when pressing against a rigid surface, breaching the “2/3-rule” between the joints of a ﬁnger when pushing a button, etc.). Tracking hands under these less fa- vorable conditions is the topic of this paper. To the best of our knowledge, visual hand tracking in the presence of objects is uncharted terrain. Object manipulation is an inherently 3-dimensional phe- nomenon, whereas 3D pose estimation in monocular video is seriously under-constrained. We therefore base our es- timations not only on image color, but also on 2.5D depth maps. In our case, the depth maps are obtained with a real- time structured light system [19], but in the near future such data will in all likelihood be available at negligible cost, due to the rapid progress of time-of-ﬂight sensors [14, 3]. Our approach has been inspired by an established trend in object recognition and detection. Occlusion is a frequent and not reliably solved problem in these applications. Mod- els are split into local parts, and each part separately con- tributes evidence about the complete model. In this way ro- bustness to partial occlusion is achieved and the estimation relies only on observable parts, e.g. [11, 10]. The underly- ing global conﬁguration can then be used to infer informa- tion regarding the occluded parts. In much the same way,