A Spatial-Temporal Attention Model for Human Trajectory Prediction Xiaodong Zhao, Yaran Chen, Jin Guo, and Dongbin Zhao, Fellow, IEEE Abstract—Human trajectory prediction is essential and promising in many related applications. This is challenging due to the uncertainty of human behaviors, which can be influenced not only by himself, but also by the surrounding environment. Recent works based on long-short term memory (LSTM) models have brought tremendous improvements on the task of trajectory prediction. However, most of them focus on the spatial influence of humans but ignore the temporal influence. In this paper, we propose a novel spatial-temporal attention (ST-Attention) model, which studies spatial and temporal affinities jointly. Specifically, we introduce an attention mechanism to extract temporal affinity, learning the importance for historical trajectory information at different time instants. To explore spatial affinity, a deep neural network is employed to measure different importance of the neighbors. Experimental results show that our method achieves competitive performance compared with state-of-the-art methods on publicly available datasets. Index Terms—Attention mechanism, long-short term memory (LSTM), spatial-temporal model, trajectory prediction. I. Introduction H UMAN trajectory prediction is to predict future path according to the history trajectory. The trajectory is represented by a set of sampled consecutive location coordinates. Trajectory prediction is a core building block for autonomous moving platforms, and the prospective applications include autonomous driving [1]–[3], mobile robot navigation [4], assistive technologies [5], and smart video surveillance [6], etc. When a person is walking in the crowd, the future path is determined by various factors like the intention, the social conventions and the influence of nearby people. For instance, people prefer to walk along the sidewalk rather than crossing the highway. A person is able to adjust his path by estimating the future path of the people around him, and the people do the same thing which in turn affects the target. Human trajectory prediction becomes an extremely challenging problem due to such complex nature of the people. Benefiting from the powerful deep learning [7], [8], human trajectory prediction has gained a significant improvement in the last few years. Yagi et al. in [5] present a multi-stream convolution- deconvolution architecture for first-person videos, which verifies pose, scale, and ego-motion cues are useful for the future person localization. Pioneering works by [9], [10] shows that long-short term memory (LSTM) has the capacity to learn general human movements and predict future trajectories. Although tremendous efforts have been made to address these challenges, there are still two limitations: 1) The historical trajectory information at different time instants has different levels of influence on the target human, which is ignored by most of works. However, it plays an important role on the prediction of the future path. As for the target human, the latest trajectory information usually has a higher level of influence on the future path as shown in Fig. 1(a). As for the neighbors, the trajectory information will have a great impact as long as the distance is close to the target, as shown in Fig. 1(b). Thus, the historical trajectory information at different time instants ought to be given different weights. The attention mechanism is capable of learning different weights according to the importance. Manuscript received March 24, 2020; accepted April 13, 2020. This work was supported by the National Key Research and Development Program of China (2018AAA0101005, 2018AAA0102404), the Program of the Huawei Technologies Co. Ltd. (FA2018111061SOW12), the National Natural Science Foundation of China (61773054), and the Youth Research Fund of the State Key Laboratory of Complex Systems Management and Control (20190213). Recommended by Associate Editor Guangjie Han. (Corresponding author: Yaran Chen and Jin Guo.) Citation: X. D. Zhao, Y. R. Chen, J. Guo, and D. B. Zhao, “A spatial- temporal attention model for human trajectory prediction,” IEEE/CAA J. Autom. Sinica, vol. 7, no. 4, pp. 965–974, Jul. 2020. X. D. Zhao is with the School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, and also with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: s20180612@xs.ustb.edu.cn). Y. R. Chen and D. B. Zhao are with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (e-mail: chenyaran2013@ia.ac.cn; dongbin.zhao@ia.ac.cn). J. Guo is with the School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, and also with the Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing 100083, China (e-mail: guojin@ustb.edu.cn). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JAS.2020.1003228 t t t t−3 t−2 t−1 t−1 t−1 t−2 (a) (b) P T P T P N P T t − 1 t t − 2 t − 3 P N P T t P N t − 1 P T P T P N Fig. 1. Illustration of the influences at different time instants. (a) As for the target human ( ), the trajectory information at time and may affect future path more compared with that at time and . (b) As for the neighbor ( ), he turns away from at time . The trajectory information of at time has a greater influence on considering that is not allowed to occupy the position where just lefts. IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 7, NO. 4, JULY 2020 965