©2021 XXXX April 21, 2021, Mann, GJ 1 Real-Time Violence Detection Using CNN-LSTM Mann B. Patel Charotar University of Science and Technology 18dcs074@charusat.edu.in Abstract - Violence rates however have been brought down about 57% during the span of past 4 decades yet it doesn't change the way that the demonstration of violence actually happens, unseen by the law. Violence can be mass controlled sometimes by higher authorities, however to hold everything in line one must "Micro- govern" over each movement occurring in every road of each square. To address the butterfly effects impact in our setting, I made a unique model and a theorized system to handle the issue utilizing deep learning. The model takes the input of the CCTV video feeds and after drawing inference, recognizes if a violent movement is going on. And hypothesized architecture aims towards probability driven computation of video feeds and reduces overhead from naively computing for every CCTV video feeds. KEYWORDS activity recognition, deep learning, inference algorithm, pipeline, supervised, surveillance, violence INTRODUCTION I propose a pseudo real time Violence detection system, which takes a video, may it be with audio or without, and somehow alerts when violent activities are detected. This project solely focuses on taking inferences from whatever data I are able to extract out of the video feeds coming from the CCTV networks to one workstation (or in case of parallelism, cluster). I try to tackle the violent detection challenge using two novel approaches. So, upon doing basic testing, I choose the CNN + LSTM approach, in further cases I also try and test different models of CNN to get which one provides the most accuracy. I also try to extract information from audio of the video, and try get inference from it. Also, I hypothesized the signaling mechanism and convenient algorithm to further a lot computation to a video feed for early detection. To find out which approach is better, I tried out both approach on, I can just run a simple test run after training over a preprocessed dataset and let them infer over 20 frames randomly extracted from any video on the dataset and see what turns out. DATASET To test our methodology, we work with these three datasets, Hockey Fight Dataset [4], Movies Dataset [5] and Violent- Flows [6]. the 3 datasets captured from closed- circuited-TV, Phone or high-resolution recorder, the quality, number of pixels and length varies between dataset. • Hockey fights: Dataset composed of equal number of violence and nonviolence action during hockey professional matches, usually Two players participating in close body interaction. • Movies: This dataset consists fight sequences collected from movies, for the non-violence label - videos of general action activity gathered from movies. The dataset is made up of equal number of violent movie clips and non-violent movie clips. Unlike the Hockey dataset, this dataset varies profoundly between samples. • Violent-flow: This is a crowd violence dataset. Most of the crowd violence seen in this dataset are clips of football matches. Dataset Description Total videos Labelled Violent Labelled Non-Violent Final pickled size Hockey fights hockey players 1000 500 500 ≈200 MB Violent-Flows big crowd videos 200 100 100 ≈100 MB Movies movies clip 246 123 123 ≈150 MB TABLE 1 DATASET SUMMARY