IEEE TRANSACTIONS ON CYBERNETICS 1 Fast Online Video Pose Estimation by Dynamic Bayesian Modeling of Mode Transitions Ming-Ching Chang, Lipeng Ke, Honggang Qi, Longyin Wen, Siwei Lyu Abstract—We propose a fast online video pose estimation method to detect and track human upper-body poses based on a conditional dynamic Bayesian modeling of pose modes without referring to future frames. Estimation of human body poses from video is an important task with many applications. Our method extends fast image-based pose estimation to live video streams by leveraging the temporal correlation of articulated poses between frames. Video pose estimation are inferred over a time window using a conditional dynamic Bayesian network (CDBN), which we term T-CDBN. Specifically, latent pose modes and their transitions are modeled and co-determined from the combination of three modules: (1) inference based on current observations, (2) the modeling of mode-to-mode transitions as a probabilistic prior, and (3) the modeling of state-to-mode transitions using a multi-mode softmax regression. Given the predicted pose modes, the body poses in terms of arm joint locations can then be determined more accurately and robustly. Our method is suitable to investigate high frame rate (HFR) scenarios, where pose mode transitions can effectively capture action-related priors to boost performance. We evaluate our method on a newly collected HFR- Pose dataset and four major video pose datasets (VideoPose2, TUM Kitchen, FLIC and Penn Action). Our method achieves improvements in both accuracy and efficiency over existing online video pose estimation methods. I. I NTRODUCTION As the basis for understanding human actions and behaviors from visual imagery, upper-body pose estimation from videos has many applications, including gesture recognition, human computer interaction, gaming, sign language recognition, and the study of affective and social behaviors. With the ubiq- uity of inexpensive video cameras on mobile devices, it has become increasingly easy to capture live feed videos, from which upper-body poses or gestures can be estimated as a continuous time series for further processing. As such, we focus on electro-optical (RGB) videos rather than RGB+D videos (those obtained from e.g. Microsoft Kinect) that operate based on depth sensors, for the consideration of generality and applicability in real-world usages. With the maturity of efficient image-based pose estimation methods [2], [3], [1], [4], [5], [6], [7], one naive solution to video pose estimation is to apply image-based methods to individual frames of a video as if they are independent images. However, this approach works only to a certain extent. The Ming-Ching Chang is with the Computer Engineering Department, Univer- sity at Albany, SUNY, NY, USA. Lipeng Ke and Honggang Qi are with the School of Computer and Control Engineering, University of the Chinese Academy of Sciences, Beijing, China. Longyin Wen is with the GE Global Research Center, NY, USA. Siwei Lyu is with the Computer Science Department, University at Albany, SUNY, NY, USA. Siwei Lyu is the corresponding author. (a) (b) Fig. 1. Method overview. (a) The proposed T-CDBN model structure for online video pose estimation and (b) an example of the T-CDBN-MODEC based on the MODEC single-image pose estimation [1] applied to online video pose estimation. xt corresponds to observations (i.e., image features) at time t, yt corresponds to body pose (i.e., joint locations), and zt is the latent pose mode in individual frames. See Section III-A for explanations. main drawback is that the strong temporal correlations of artic- ulated poses between video frames are discarded. It is intuitive that human activities usually involve smooth and continuous hand movements. Thus the continuity of poses in consecutive video frames provides strong cues for robust pose estimation through tracking and prediction. Treating individual frames without considering temporal correlations leads to inefficient algorithm and inaccurate estimations, due to ambiguities and occlusions in a single frame. In contrast, estimating upper- body poses as a continuous temporal sequence leads to better handling of occlusions and robustness in estimation. Methods for pose estimation from videos (and particularly the ones focusing on upper-body poses) have advanced signif- icantly in recent years e.g. [8], [9], [10], [11], [12], [13], [14]. However, the majority of existing methods are offline in nature, i.e., upper-body poses in the current frame are inferred using both the past and future frames. The performance of these methods on estimating poses usually comes at the price of complicated inference procedures, which significantly reduces the running efficiency. Thus these methods do not address the practical needs of fast video pose estimation.