Pattern Recognition 110 (2021) 107631 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/patcog Sparse motion ﬁelds for trajectory prediction Catarina Barata a,∗ , Jacinto C. Nascimento a , João M. Lemos b , Jorge S. Marques a a Institute for Systems and Robotics, Instituto Superio Técnico, Universidade de Lisboa, Portugal b INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Portugal a r t i c l e i n f o Article history: Received 22 January 2020 Revised 9 August 2020 Accepted 6 September 2020 Available online 10 September Keywords: Human motion analysis Trajectory prediction Sparse motion ﬁelds a b s t r a c t Trajectory prediction is a crucial element of many automated tasks, such as autonomous navigation or video surveillance. To automatically predict the motion of an agent (e.g., pedestrian or car), the model needs to eﬃciently represent human motion and “understand” the external stimuli that may inﬂuence human behavior. In this work we propose a methodology to model the motion of agents in a video scene. Our method is based on space-varying sparse motion ﬁelds, which simultaneously characterize diverse motion patterns in the scene and implicitly learn contextual cues about the static environment, namely obstacles and semantic constraints. The sparse motion ﬁelds are applied to the task of long-term trajectory prediction using a probabilistic generative approach. Several benchmark data sets are used to demonstrate the potential of the proposed approach and show that our method achieves competitive state-of-the-art performances. © 2020 Elsevier Ltd. All rights reserved. 1. Introduction 1.1. Motivation The ability to describe and interpret the behavior of various agents in a scene is a key factor towards its understanding. This is a requirement in areas such as video surveillance, sports analy- sis, and robotic or autonomous cars navigation, where the provided information may be used to address several tasks (e.g., tracking, activity recognition, and detection of abnormal behaviors) [1,2]. All of the aforementioned tasks rely on the assessment of the motion performed by the agent. Trajectory data, i.e. the set of consecutive 2D positions of an agent, are known to provide relevant cues to understand the human motion behavior. Thus, it has been adopted by several works in the literature, in particular those devoted to short and long-term path prediction. Human motion is governed by a variety of factors, namely agent-speciﬁc cues (e.g., intended destination or preferred veloc- ity) and environment characteristics [4]. The latter can be divided into: i) dynamic environment, which accounts for the interactions with other agents (e.g., neighbor pedestrians or cars) [5–7]; and ii) static environment, which characterizes the several physical con- straints of the scene, i.e., its semantic (e.g., buildings, roads, and sidewalks) and/or individual obstacles [8,9]. The majority of re- cent approaches puts a signiﬁcant emphasis on the characteriza- ∗ Corresponding author. E-mail address: ana.c.ﬁdalgo.barata@tecnico.ulisboa.pt (C. Barata). tion of the dynamic environment, with the adoption of method- ologies based on neural networks [7,10,11]. Despite the undeniable importance of the dynamic environment in very crowded scenes, where interactions such as avoiding collisions are prone to occur, the relevance of the static environment should not be disregarded. For once, motion models solely based on dynamic cues have been shown to underperform when the static environment strongly in- ﬂuences the trajectories [11,12]. In this case, the motion models are able to capture the inﬂuence of the surrounding agents. However, they do not have any information regarding the semantic of the scene (e.g., walkable and forbidden regions) nor about the presence of static obstacles. When applied to the task of trajectory predic- tion, such models can generate unrealistic trajectories that do not comply with the physical constraints of the scene. Methods based on agent interactions are also unsuitable to deal with scenes where the agent’s density is low, since the motion will be mostly guided by the static environment. Recently, a few works demonstrated that the static environment may play a very relevant role (e.g., [13–15]). However, these meth- ods are unable to learn the physical structure of a scene with- out using additional information, such as semantic maps or image features extracted from video frames. In this work we argue that such data is not required, since the movement of the agents in a scene already conveys information about the static environment (e.g., pedestrians will tend to move on sidewalks, cars will not en- ter buildings, and obstacles will be avoided). We assume that the physical properties of a scene can be learned in an unsupervised way, directly from trajectory data. To achieve this goal, we pro- https://doi.org/10.1016/j.patcog.2020.107631 0031-3203/© 2020 Elsevier Ltd. All rights reserved.