Vol.:(0123456789) 1 3 Multimedia Systems https://doi.org/10.1007/s00530-020-00652-x SPECIAL ISSUE PAPER Multi‑feature‑based crowd video modeling for visual event detection Habib Ullah 1  · Ihtesham Ul Islam 2  · Mohib Ullah 3  · Muhammad Afaq 4  · Sultan Daud Khan 1  · Javed Iqbal 2 © Springer-Verlag GmbH Germany, part of Springer Nature 2020 Abstract We propose a novel method for modeling crowd video dynamics by adopting a two-stream convolutional architecture which incorporates spatial and temporal networks. Our proposed method cope with the key challenge of capturing the complemen- tary information on appearance from still frames and motion between frames. In our proposed method, a motion fow feld is obtained from the video through dense optical fow. We demonstrate that the proposed method trained on multi-frame dense optical fow achieves signifcant improvement in performance in spite of limited training data. We train and evaluate our proposed method on a benchmark crowd video dataset. The experimental results of our method show that it outperforms fve reference methods. We have chosen these reference methods since they are the most relevant to our work. Keywords Crowd analysis · Video modeling · Deep learning · CNN 1 Introduction The human population growth in the world has signif- cantly increased in the past decade as investigated by Devila [1]. This growth makes the safety of people, especially in crowded areas, a challenging issue. Therefore, it is essential to analyze the crowded scenes to facilitate smart video sur- veillance for ensuring people safety. One of the key tasks is to classify crowd videos to fnd out potential risks [2, 3] associated with the crowded areas. Besides that crowd video modeling for diferent events has vast domain of its applica- tions in robotics [4, 5], human–computer interaction [6, 7], sports analysis [810], video games [11], and management of web videos [12]. Most of the recent studies for crowd analysis address low- to medium-density crowd scenes. They are in general based on hand-crafted features, which are only useful in captur- ing certain characteristics of crowd content but typically limits their deployment in generic setting as approaches are fne-tuned for specifc conditions. Moreover, rather than identifying collective crowd behaviors, these studies cope with the problems of localized abnormal behavior detec- tion [13], tracking individuals in crowds [1416], counting people in crowds [17], and identifying diferent regions of motion using segmentation [18]. Limited eforts have been carried out to address the problem of modeling dense crowd videos, since it is a difcult and complex problem due to challenging spatio-temporal characteristics. One key chal- lenge is inconsistency in a crowd scene. For example, a large crowd scene may be scattered and sparse as depicted in Fig. 1. The distribution of people represents segments of the crowd fows located in diferent places of the scene. When the density of the people changes over time, the distribution changes accordingly. Therefore, the distribution and density of crowd afect its coherency that represents the intercon- nection among diferent segments. A consistent distribution * Ihtesham Ul Islam ihtesham.csit@suit.edu.pk Habib Ullah h.ullah@uoh.edu.sa Mohib Ullah mohib.ullah@ntnu.no Muhammad Afaq afaq@jejunu.ac.kr Sultan Daud Khan su.khan@uoh.edu.sa Javed Iqbal javed.ee@suit.edu.pk 1 College of Computer Science and Engineering, University of Hail, Hail, Saudi Arabia 2 Sarhad University of Science and IT, Peshawar 25000, Pakistan 3 Norwegian University of Science and Technology, Gjovik, Norway 4 Jeju National University, Jeju-Si, South Korea