©2021 XXXX April 21, 2021, Mann, GJ
1
Real-Time Violence Detection Using CNN-LSTM
Mann B. Patel
Charotar University of Science and Technology
18dcs074@charusat.edu.in
Abstract - Violence rates however have been brought
down about 57% during the span of past 4 decades yet it
doesn't change the way that the demonstration of
violence actually happens, unseen by the law. Violence
can be mass controlled sometimes by higher authorities,
however to hold everything in line one must "Micro-
govern" over each movement occurring in every road of
each square. To address the butterfly effects impact in
our setting, I made a unique model and a theorized
system to handle the issue utilizing deep learning. The
model takes the input of the CCTV video feeds and after
drawing inference, recognizes if a violent movement is
going on. And hypothesized architecture aims towards
probability driven computation of video feeds and
reduces overhead from naively computing for every
CCTV video feeds.
KEYWORDS
activity recognition, deep learning, inference algorithm,
pipeline, supervised, surveillance, violence
INTRODUCTION
I propose a pseudo real time Violence detection system,
which takes a video, may it be with audio or without, and
somehow alerts when violent activities are detected. This
project solely focuses on taking inferences from whatever
data I are able to extract out of the video feeds coming from
the CCTV networks to one workstation (or in case of
parallelism, cluster). I try to tackle the violent detection
challenge using two novel approaches. So, upon doing basic
testing, I choose the CNN + LSTM approach, in further
cases I also try and test different models of CNN to get
which one provides the most accuracy. I also try to extract
information from audio of the video, and try get inference
from it. Also, I hypothesized the signaling mechanism and
convenient algorithm to further a lot computation to a video
feed for early detection.
To find out which approach is better, I tried out
both approach on, I can just run a simple test run after
training over a preprocessed dataset and let them infer over
20 frames randomly extracted from any video on the dataset
and see what turns out.
DATASET
To test our methodology, we work with these three datasets,
Hockey Fight Dataset [4], Movies Dataset [5] and Violent-
Flows [6]. the 3 datasets captured from closed- circuited-TV,
Phone or high-resolution recorder, the quality, number of
pixels and length varies between dataset.
• Hockey fights: Dataset composed of equal number of
violence and nonviolence action during hockey professional
matches, usually Two players participating in close body
interaction.
• Movies: This dataset consists fight sequences collected
from movies, for the non-violence label - videos of general
action activity gathered from movies. The dataset is made up
of equal number of violent movie clips and non-violent
movie clips. Unlike the Hockey dataset, this dataset varies
profoundly between samples.
• Violent-flow: This is a crowd violence dataset. Most of the
crowd violence seen in this dataset are clips of football
matches.
Dataset Description Total videos Labelled Violent Labelled Non-Violent Final pickled size
Hockey fights hockey players 1000 500 500 ≈200 MB
Violent-Flows big crowd videos 200 100 100 ≈100 MB
Movies movies clip 246 123 123 ≈150 MB
TABLE 1
DATASET SUMMARY