Vol.:(0123456789) 1 3
Multimedia Systems
https://doi.org/10.1007/s00530-020-00652-x
SPECIAL ISSUE PAPER
Multi‑feature‑based crowd video modeling for visual event detection
Habib Ullah
1
· Ihtesham Ul Islam
2
· Mohib Ullah
3
· Muhammad Afaq
4
· Sultan Daud Khan
1
· Javed Iqbal
2
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
We propose a novel method for modeling crowd video dynamics by adopting a two-stream convolutional architecture which
incorporates spatial and temporal networks. Our proposed method cope with the key challenge of capturing the complemen-
tary information on appearance from still frames and motion between frames. In our proposed method, a motion fow feld
is obtained from the video through dense optical fow. We demonstrate that the proposed method trained on multi-frame
dense optical fow achieves signifcant improvement in performance in spite of limited training data. We train and evaluate
our proposed method on a benchmark crowd video dataset. The experimental results of our method show that it outperforms
fve reference methods. We have chosen these reference methods since they are the most relevant to our work.
Keywords Crowd analysis · Video modeling · Deep learning · CNN
1 Introduction
The human population growth in the world has signif-
cantly increased in the past decade as investigated by Devila
[1]. This growth makes the safety of people, especially in
crowded areas, a challenging issue. Therefore, it is essential
to analyze the crowded scenes to facilitate smart video sur-
veillance for ensuring people safety. One of the key tasks
is to classify crowd videos to fnd out potential risks [2, 3]
associated with the crowded areas. Besides that crowd video
modeling for diferent events has vast domain of its applica-
tions in robotics [4, 5], human–computer interaction [6, 7],
sports analysis [8–10], video games [11], and management
of web videos [12].
Most of the recent studies for crowd analysis address low-
to medium-density crowd scenes. They are in general based
on hand-crafted features, which are only useful in captur-
ing certain characteristics of crowd content but typically
limits their deployment in generic setting as approaches are
fne-tuned for specifc conditions. Moreover, rather than
identifying collective crowd behaviors, these studies cope
with the problems of localized abnormal behavior detec-
tion [13], tracking individuals in crowds [14–16], counting
people in crowds [17], and identifying diferent regions of
motion using segmentation [18]. Limited eforts have been
carried out to address the problem of modeling dense crowd
videos, since it is a difcult and complex problem due to
challenging spatio-temporal characteristics. One key chal-
lenge is inconsistency in a crowd scene. For example, a large
crowd scene may be scattered and sparse as depicted in
Fig. 1. The distribution of people represents segments of the
crowd fows located in diferent places of the scene. When
the density of the people changes over time, the distribution
changes accordingly. Therefore, the distribution and density
of crowd afect its coherency that represents the intercon-
nection among diferent segments. A consistent distribution
* Ihtesham Ul Islam
ihtesham.csit@suit.edu.pk
Habib Ullah
h.ullah@uoh.edu.sa
Mohib Ullah
mohib.ullah@ntnu.no
Muhammad Afaq
afaq@jejunu.ac.kr
Sultan Daud Khan
su.khan@uoh.edu.sa
Javed Iqbal
javed.ee@suit.edu.pk
1
College of Computer Science and Engineering, University
of Hail, Hail, Saudi Arabia
2
Sarhad University of Science and IT, Peshawar 25000,
Pakistan
3
Norwegian University of Science and Technology, Gjovik,
Norway
4
Jeju National University, Jeju-Si, South Korea