Vol.:(0123456789)
Discover Artificial Intelligence (2021) 1:6 | https://doi.org/10.1007/s44163-021-00004-2
1 3
Discover Artifcial Intelligence
Perspective
Exploring Convolutional Recurrent architectures for anomaly
detection in videos: a comparative study
Ambareesh Ravi
1
· Fakhri Karray
1,2
Received: 24 June 2021 / Accepted: 4 August 2021
© The Author(s) 2021 OPEN
Abstract
Convolutional Recurrent architectures are currently preferred for spatio-temporal learning tasks in videos to the 3D
convolutional networks which accompany a huge computational burden and it is imperative to understand the working
of diferent architectural confgurations. But most of the current works on visual learning, especially for video anomaly
detection, predominantly employ ConvLSTM networks and focus less on other possible variants of Convolutional Recur-
rent confgurations for temporal learning which warrants a need to study the diferent possible variants to make informed,
optimal design choices according to the nature of the application at hand. We explore a variety of Convolutional Recurrent
architectures and the infuence of hyper-parameters on their performance for the task of anomaly detection. Through this
work, we also intend to quantify the efciency of the architectures based on the trade-of between their per formance
and computational complexity. With comprehensive quantitative and visual evidence, we establish that the ConvGRU
based confgurations are the most efective and per form better than the popular ConvLSTM confgurations on video
anomaly detection tasks, in contrast to what is seen from the literature.
Keywords Video anomaly detection · ConvLSTM · ConvRNN · ConvGRU · Seq2Seq architectures
1 Introduction
Understanding videos has been one of the most challenging and open problems in computer vision [1–3] for applica-
tions such as action recognition, scene description, video captioning, video summarization and video anomaly detec-
tion. Video Anomaly Detection (VAD) is the process of identifying abnormal, rare and novel events concerning time and
region of the video frames with several real-world applications in areas like security, surveillance [4–8], manufacturing
[9], medicine [10] etc. Deep learning and Convolutional Neural Networks are predominantly used for visual tasks owing
to their superior performance which can be attributed to their ability to uncover and learn hidden patterns and general-
ize well on huge datasets. But most of the prevalent deep learning architectures require heavy computational and huge
memory storage resources prohibiting them from being used on edge devices for small applications and on-premise
computation for data privacy reasons. In systems involving real-time detection and alerts like video surveillance, the
model needs to be highly efcient in inference and accurate with decisions.
Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s44163-021-
00004-2.
* Ambareesh Ravi, ambareesh.ravi@uwaterloo.ca; Fakhri Karray, karray@uwaterloo.ca |
1
Department of Electrical and Computer
Engineering, Center for Pattern Analysis and Machine Intelligence (CPAMI), University of Waterloo, Ontario N2L 3G1, Canada.
2
Muhammad
Ben Zayed University of AI, Masdar City, Abu Dhabi, UAE.