A Cognitive Memory-Augmented Network for Visual Anomaly Detection Tian Wang, Xing Xu, Member, IEEE, Fumin Shen, Member, IEEE, and Yang Yang, Senior Member, IEEE Abstract—With the rapid development of automated visual analysis, visual analysis systems have become a popular research topic in the field of computer vision and automated analysis. Visual analysis systems can assist humans to detect anomalous events (e.g., fighting, walking alone on the grass, etc). In general, the existing methods for visual anomaly detection are usually based on an autoencoder architecture, i.e., reconstructing the current frame or predicting the future frame. Then, the reconstruction error is adopted as the evaluation metric to identify whether an input is abnormal or not. The flaws of the existing methods are that abnormal samples can also be reconstructed well. In this paper, inspired by the human memory ability, we propose a novel deep neural network (DNN) based model termed cognitive memory-augmented network (CMAN) for the visual anomaly detection problem. The proposed CMAN model assumes that the visual analysis system imitates humans to remember normal samples and then distinguishes abnormal events from the collected videos. Specifically, in the proposed CMAN model, we introduce a memory module that is able to simulate the memory capacity of humans and a density estimation network that can learn the data distribution. The reconstruction errors and the novelty scores are used to distinguish abnormal events from videos. In addition, we develop a two-step scheme to train the proposed model so that the proposed memory module and the density estimation network can cooperate to improve performance. Comprehensive experiments evaluated on various popular benchmarks show the superiority and effectiveness of the proposed CMAN model for visual anomaly detection comparing with the state-of-the-arts methods. The implementation code of our CMAN method can be accessed at https://github.com/CMAN- code/CMAN_pytorch. Index Terms—Cognitive computing, density estimation, memory, visual analysis systems, visual anomaly detection. I. Introduction A NOMALY detection in a video sequence is the process of identifying abnormal events which are unexpected to happen in the video, e.g., fighting, walking alone on the grass, and vehicles on the sidewalk. Visual anomaly detection is to apply anomaly detection on visual data, and its main purpose is to find abnormal data in visual data. It has attracted increasing attention in both the research and industrial communities, since it can solve many practical problems in life, e.g., security problems, automatic early warning of natural disasters, and analysis of traffic monitoring videos. For example, the application of intelligent robots with visual anomaly detection models on analyzing surveillance video data in public places can determine whether a fight has occurred and made an issue warning. Generally, visual anomaly detection is an extremely challenging problem for the following two reasons. Firstly, in the real world, the frequency of abnormal events is very low, so it is difficult to collect abnormal events. Secondly, the same event could produce a completely opposite result in different environments (e.g., walking alone on the sidewalk or on the grass). Therefore, visual anomaly detection is naturally treated as an unsupervised learning problem in the literature, with the purpose of learning a model trained on normal data. Generally, the existing methods [1]–[4] for solving visual anomaly detection adopt the reconstruction method and its diverse variant methods. At the training phase, a normal video clip is inputted into the proposed model. Then, the model takes an extracted feature representation to reconstruct the input video clip again. The reconstruction error between an input and a reconstruction could be used as a criterion for detecting whether an input is abnormal. Recently, deep neural networks (DNNs) have been widely used in the field of computer vision, and the reconstruction-based method has also benefited from them. The existing methods usually select an autoencoder (AE) [5], [6] based on convolutional neural networks (CNNs) as basic architecture, such as [3], [7]. AE includes an encoder obtaining a feature representation and a decoder decoding the feature representation back to the original image space. Thanks to the powerful feature representation capability of CNNs, the AE could be used to obtain the input’s feature representation in latent space. Generally, the feature representation from abnormal frames is supposed to be not reconstructed well by an AE, as the AE trains on normal data, which means that abnormal frames should have larger a reconstruction error than normal frames. However, it could be reconstructed well as the abnormal input Manuscript received January 3, 2021; revised February 24, 2021; accepted March 28, 2021. This work was supported in part by the National Natural Science Foundation of China (61976049, 62072080, U20B2063), the Fundamental Research Funds for the Central Universities (ZYGX2019Z015), the Sichuan Science and Technology Program, China (2018GZDZX0032, 2019ZDZX0008, 2019YFG0003, 2019YFG0533, 2020YFS0057), and Dongguan Songshan Lake Introduction Program of Leading Innovative and Entrepreneurial Talents. Recommended by Associate Editor Huimin Lu. (Corresponding author: Xing Xu.) Citation: T. Wang, X. Xu, F. Shen, and Y. Yang, “A cognitive memory- augmented network for visual anomaly detection,” IEEE/CAA J. Autom. Sinica, vol. 8, no. 7, pp. 1296–1307, Jul. 2021. T. Wang, X. Xu, and F. Shen are with the Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China (UESTC), Chengdu 611731, China (e-mail: wangtianguge@gmail.com; xing.xu@uestc.edu.cn; fumin.shen@gmail.com). Y. Yang is with the Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, and also with the Institute of Electronic and Information Engineering of UESTC in Guangdong, Dongguan 523808, China (e-mail: dlyyang@gmail.com). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JAS.2021.1004045 1296 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 8, NO. 7, JULY 2021