A Video Motion Capture System for Interactive Games Ryuzo Okada Corporate R&D Center, Toshiba Corporation ryuzo.okada@toshiba.co.jp Nobuhiro Kondoh Semiconductor Company, Toshiba Corporation nobuhiro.kondoh@toshiba.co.jp Bj¨ orn Stenger Toshiba Research Europe Ltd, Computer Vision Group bjorn.stenger@crl.toshiba.co.uk Abstract This paper presents a method for markerless human mo- tion capture using a single camera. It uses tree-based filter- ing to efficiently propagate a probability distribution over poses of a 3D body model. The pose vectors and associ- ated shapes are arranged in a tree, which is constructed by hierarchical pairwise clustering, in order to efficiently evaluate the likelihood in each frame. A new likelihood function is proposed that improves the pose estimation of thinner body parts, i.e. the limbs. The dynamic model takes self-occlusion into account by increasing the variance of occluded body-parts, thus allowing for recovery when the body part reappears. An online motion capture system was implemented on two platforms: a standard PC and a system using Cell Broadband Engine TM [8]. As an application we present a computer game in which an avatar is controlled by the player’s body motion. 1. Introduction Human pose estimation from image sequences has var- ious applications in areas such as human-computer inter- faces, computer games, and avatar animation, and is an area of active research [1, 2, 4, 5, 7, 9, 11, 13, 12, 16, 18]. Some applications, such as gesture interfaces for gam- ing, require real-time capability, thus an efficient search for the optimal pose is important. Real-time motion capture has been achieved using incremental tracking, however, in this case the problem of initial pose estimation needs to be solved and often estimation errors can accumulate over long image sequences [11, 18]. Detecting body parts [4, 13, 17] can reduce the computational cost and does not require a manual initial pose estimate, but finding body parts in a sin- gle view is particularly difficult because of self-occlusion. Efficient versions of particle filtering have been used with success in the past, but they have the drawback of requir- ing pose initialization at the start and when tracking failure occurs [5]. Recently learning-based methods have received more at- tention, where a mapping from observation to body pose is learned from a large set of training examples [1, 10, 14]. However, these methods do not adapt the final model esti- mate to an individual subject. In this paper we present a system for real-time pose esti- mation using a single camera without markers. Our method is based on tree-based filtering, where the current pose is estimated by hierarchically evaluating observation likeli- hoods from image silhouettes while taking temporal con- sistency of the poses into account [15]. This paper introduces several innovations that improve robustness and efficiency: (1) A 3D body model, selected from a discrete set according to the user’s body size, is used to generate silhouettes that are used for more accu- rate matching. (2) To further increase the computational efficiency, we evaluate the silhouette distance on an im- age pyramid using different image resolutions for different tree levels. (3) The dynamic model explicitly takes self- occlusion into account by increasing the variance of the joint parameters of occluded body-parts. This relaxes the temporal constraint on such parts in order to resume track- ing them when they reappear. (4) The cost function for sil- houette matching is based on weighted distance functions with equal weight on the ‘shape skeleton’ of the silhouette. Using this normalized weight improves the estimation with respect to thinner body parts such as arms and legs. 2. Tree-based filtering framework Tracking of pose is formulated using a probabilistic framework as follows: Given the observations up to time t, z 1:t , the aim is to estimate the posterior distribution of the state x t which consists of joint angles and 3D position. With the Markov assumption, that the observation at time t is independent of all past observations given x t , the pos- terior is updated using the Bayes rule when obtaining the observation at time t: p(x t |z 1:t )= c t p(z t |x t )p(x t |z 1:t−1 ), (1) where c t is a normalization constant and p(z t |x t ) and p(x t |z 1:t−1 ) are likelihood and prior distribution, respec- tively. The prior is computed as follows: p(x t |z 1:t−1 )= xt-1 p(x t |x t−1 )p(x t−1 |z 1:t−1 ), (2) where p(x t |x t−1 ) is the probability distribution for state transitions. The state posterior distribution is estimated in each time step by repeated application of prediction (Eq. 2) and update (Eq. 1). For computational efficiency a tree structure is used to compute discrete approximations to these distributions. 2.1. Hierarchy of silhouette shapes Using a marker-based motion capture system, pose data from three subjects is collected. The pose data is used to 1