published in 2011 IEEE Int. Conf. on Computer Vision workshops, 906-913, Barcelona/Spain final version c by IEEE, see IEEE Xplore On the Effect of Temporal Information on Monocular 3D Human Pose Estimation J¨ urgen Brauer ⋆ , Wenjuan Gong ⋄ , Jordi Gonz` alez ⋄ , Michael Arens ⋆ ⋆ Fraunhofer IOSB, Ettlingen, Germany ⋄ Computer Vision Center, Universitat Aut` onoma de Barcelona, Spain ⋆ {juergen.brauer,michael.arens}@iosb.fraunhofer.de, ⋄ {wenjuan,poal}@cvc.uab.es Abstract We address the task of estimating 3D human poses from monocular camera sequences. Many works make use of multiple consecutive frames for the estimation of a 3D pose in a frame. Although such an approach should ease the pose estimation task substantially since multiple consecu- tive frames allow to solve for 2D projection ambiguities in principle, it has not yet been investigated systematically how much we can improve the 3D pose estimates when us- ing multiple consecutive frames opposed to single frame in- formation. In this paper we analyze the difference in quality of 3D pose estimates based on different numbers of consecutive frames from which 2D pose estimates are available. We val- idate the use of temporal information on two major differ- ent approaches for human pose estimation – modeling and learning approaches. The results of our experiments show that both learning and modeling approaches benefit from using multiple frames opposed to single frame input but that the benefit is small when the 2D pose estimates show a high quality in terms of precision. 1. Introduction Estimating the 3D articulation of humans in videos is an important topic in computer vision since the knowledge about the articulation of persons opens the door for behavior analysis based on such 3D pose estimates. If only monoc- ular camera information is available, the task of identifying the 3D pose of persons in videos showing a huge variety of actions, lighting conditions, person occlusions, and clut- tered background can be considered as yet unsolved. To ease this problem, the idea of using temporal information for estimating the 3D pose in a frame is obvious since this should allow to solve for ambiguities. A seminal work that motivated the idea of using several consecutive frames for pose estimation was the work by Jo- hansson [5] with so called Moving Light Displays (MLD), which are a small number of light sources attached to the body of a person in a dark scene. Johansson showed that a small number of points of the human body is sufficient to recognize and discriminate human poses and motions cor- rectly if sequences of these points are presented. Rashid [7] showed that such sequences of 2D points allow to identify the body parts of two walking persons even in the case of a short overlap of both 2D point sets. One popular way to incorporate temporal information is to improve the quality of pose estimation with motion mod- els. For example, Urtasun et al.[14] use learned motion models for human pose tracking. The temporal correlations between consecutive frames are incorporated into motion models. Human pose estimation is then confined by this learned motion models. Instead of learning motion models, the methods presented here incorporate temporal informa- tion by using several consecutive frames as input. In this way, we can get rid of the learning phase for motion mod- els. There is a huge variety in how multiple consecutive frames are used for human pose estimation. Singh and Nevatia [11] e.g. track individual body parts over multi- ple frames using a particle filtering approach that incorpo- rates kinematic constraints. Andriluka et al.[1] first iden- tify complete 2D poses for single frames and then uses se- quences of 2D poses over multiple frames (’2D tracklets’) as input for a 3D pose estimator. Daubney et al.[3] e.g. use as observational data a sparse cloud of features extracted us- ing the Kanade-Lucas-Tomasi (KLT) feature tracker. Using the motion over multiple frames of such features, low-level part detectors are learnt directly from motion capture data. Interestingly, it has not yet been investigated systemati- cally how the number of consecutive frames influence 3D human pose estimation results. Intuitively, we would say using more frames will be better, but it is unclear how much we gain regarding the quality of 3D pose estimates com- pared to single frame based 3D pose estimation. Further, we do not know whether there is some input window size where we run into a saturation of the 3D pose estimation performance gain and simultaneously waist more and more 1