Vol.:(0123456789) Discover Artiﬁcial Intelligence (2021) 1:6 | https://doi.org/10.1007/s44163-021-00004-2 1 3 Discover Artifcial Intelligence Perspective Exploring Convolutional Recurrent architectures for anomaly detection in videos: a comparative study Ambareesh Ravi 1 · Fakhri Karray 1,2 Received: 24 June 2021 / Accepted: 4 August 2021 © The Author(s) 2021 OPEN Abstract Convolutional Recurrent architectures are currently preferred for spatio-temporal learning tasks in videos to the 3D convolutional networks which accompany a huge computational burden and it is imperative to understand the working of diferent architectural confgurations. But most of the current works on visual learning, especially for video anomaly detection, predominantly employ ConvLSTM networks and focus less on other possible variants of Convolutional Recur- rent confgurations for temporal learning which warrants a need to study the diferent possible variants to make informed, optimal design choices according to the nature of the application at hand. We explore a variety of Convolutional Recurrent architectures and the infuence of hyper-parameters on their performance for the task of anomaly detection. Through this work, we also intend to quantify the efciency of the architectures based on the trade-of between their per formance and computational complexity. With comprehensive quantitative and visual evidence, we establish that the ConvGRU based confgurations are the most efective and per form better than the popular ConvLSTM confgurations on video anomaly detection tasks, in contrast to what is seen from the literature. Keywords Video anomaly detection · ConvLSTM · ConvRNN · ConvGRU · Seq2Seq architectures 1 Introduction Understanding videos has been one of the most challenging and open problems in computer vision [1–3] for applica- tions such as action recognition, scene description, video captioning, video summarization and video anomaly detec- tion. Video Anomaly Detection (VAD) is the process of identifying abnormal, rare and novel events concerning time and region of the video frames with several real-world applications in areas like security, surveillance [4–8], manufacturing [9], medicine [10] etc. Deep learning and Convolutional Neural Networks are predominantly used for visual tasks owing to their superior performance which can be attributed to their ability to uncover and learn hidden patterns and general- ize well on huge datasets. But most of the prevalent deep learning architectures require heavy computational and huge memory storage resources prohibiting them from being used on edge devices for small applications and on-premise computation for data privacy reasons. In systems involving real-time detection and alerts like video surveillance, the model needs to be highly efcient in inference and accurate with decisions. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s44163-021- 00004-2. * Ambareesh Ravi, ambareesh.ravi@uwaterloo.ca; Fakhri Karray, karray@uwaterloo.ca | 1 Department of Electrical and Computer Engineering, Center for Pattern Analysis and Machine Intelligence (CPAMI), University of Waterloo, Ontario N2L 3G1, Canada. 2 Muhammad Ben Zayed University of AI, Masdar City, Abu Dhabi, UAE.