Flow-based Autoregressive Structured Prediction of Human Motion Mohsen Zand m.zand@queensu.ca Ali Etemad ali.etemad@queensu.ca Michael Greenspan michael.greenspan@queensu.ca Department of Electrical and Computer Engineering, Ingenuity Labs Queen’s University, Kingston, ON, Canada Abstract A new method is proposed for human motion predic- tion by learning temporal and spatial dependencies in an end-to-end deep neural network. The joint connectivity is explicitly modeled using a novel autoregressive structured prediction representation based on flow-based generative models. We learn a latent space of complex body poses in consecutive frames which is conditioned on the high- dimensional structure input sequence. To construct each latent variable, the general and local smoothness of the joint positions are considered in a generative process using conditional normalizing flows. As a result, all frame-level and joint-level continuities in the sequence are preserved in the model. This enables us to parameterize the inter-frame and intra-frame relationships and joint connectivity for ro- bust long-term predictions as well as short-term prediction. Our experiments on two challenging benchmark datasets of Human3.6M and AMASS demonstrate that our proposed method is able to effectively model the sequence informa- tion for motion prediction and outperform other techniques in 42 of the 48 total experiment scenarios to set a new state- of-the-art. 1. Introduction Automated prediction of human motion is a challenging task due to the inherent dynamic and stochastic nature, non- linearity, high dimensionality, and complex context depen- dency of motion. Human motion prediction is an essential task in computer vision with many useful applications in autonomous driving, human robot interaction, and health- care [14, 35]. Current state-of-the-art methods mainly rely on Recur- rent Neural Networks (RNNs) with the aim of model- ing contextual information in the temporal dimension via motion-based dynamics [55, 2]. The performance of RNNs Ground-Truth Zero-Velocity Seq2seq RNN-SPL T = 2 sec. 2 .1 sec. Short-term prediction Long-term prediction 3 sec. 2 .2 sec. MotionFlow 2 .6 sec. 2 .4 sec. 2 .8 sec. Figure 1: Motion prediction for a sequence from AMASS dataset [33]. From top to bottom, we show the ground truth, the results of Zero-Velocity [35], Seq2seq [35], RNN- SPL [2] and our method (MotionFlow). The results show that our method can maintain temporal and spatial smooth- ness in the target motions in both short-term and long-term prediction considerably better than the other methods. however relies on the effectiveness of the extracted spatial skeleton features. RNNs also tend to overemphasize on the temporal information in the data stream, which can lead to overfitting, especially when the training data is insuffi- arXiv:2104.04391v1 [cs.CV] 9 Apr 2021