MOVING OBJECT DETECTION IN NOISY VIDEO SEQUENCES USING DEEP CONVOLUTIONAL DISENTANGLED REPRESENTATIONS Jorge Garc´ ıa-Gonz´ alez, Rafael M. Luque-Baena, Juan M. Ortiz-de-Lazcano-Lobato, Ezequiel L´ opez-Rubio Department of Computer Languages and Computer Science. University of M´ alaga Biomedic Research Institute of M´ alaga (IBIMA) ABSTRACT Noise robustness is crucial when approaching a moving de- tection problem since image noise is easily mistaken for movement. In order to deal with the noise, deep denoising autoencoders are commonly proposed to be applied on image patches with an inherent disadvantage with respect to the segmentation resolution. In this work, a fully convolutional autoencoder-based moving detection model is proposed in order to deal with noise with no patch extraction required. Different autoencoder structures and training strategies are also tested to get insights into the best network design ap- proach. Index Terms— Moving Object Detection, Foreground Segmentation, Autoencoders 1. INTRODUCTION Foreground segmentation can be referred to as identifying genuine motion within a sequence. Unlike object detection, foreground segmentation (also known as background subtrac- tion) is based on analyzing changes in a video sequence over time. It does not make sense for a single static image. Motion This work is partially supported by the Ministry of Science, Innova- tion and Universities of Spain under grant RTI2018-094645-B-I00, project name Automated detection with low-cost hardware of unusual activities in video sequences. It is also partially supported by the Autonomous Government of Andalusia (Spain) under project UMA18-FEDERJA-084, project name Detection of anomalous behavior agents by deep learning in low-cost video surveillance intelligent systems. It is also partially sup- ported by the Autonomous Government of Andalusia (Spain) under project UMA20-FEDERJA-108, project name Detection, characterization and prog- nosis value of the non-obstructive coronary disease with deep learning. All of them include funds from the European Regional Development Fund (ERDF). It is also partially supported by the University of Malaga (Spain) under grants B1-2019 01, project name Anomaly detection on roads by moving cameras, and B1-2019 02, project name Self-Organizing Neural Systems for Non- Stationary Environments. The authors thankfully acknowledge the computer resources, technical expertise and assistance provided by the SCBI (Super- computing and Bioinformatics) center of the University of M´ alaga. They also gratefully acknowledge the support of NVIDIA Corporation with the donation of two Titan X GPUs. The authors also thankfully acknowledge the grant of the Universidad de M´ alaga and the Instituto de Investigaci´ on Biom´ edica de M´ alaga - IBIMA. detection methods are therefore usually based on the analy- sis of the change in image regions, either pixel-level [1], or patch-level [2]. They can also rely on deep learning-based object detection and tracking [3, 4]. Still, these methods rely on detection models with a limited number of classes defined at training time and are currently unable to adapt to identi- fying the motion of an object not included in these classes. Either way, noise robustness is a key feature. When applying deep learning to foreground segmentation, a common strategy is to use autoencoders [5]. An autoen- coder is a non-supervised neural network trained to return as output its input [6]. An autoencoder includes two modules: an encoder (the layers to get a coded representation of the input data) and a decoder (layers to turn the coded represen- tation into the input data again). An autoencoder can also be trained to return a denoised version of the input as out- put, and then they are known as denoising autoencoders. To train autoencoders in order to use the encoder to get a de- noised and dramatically smaller version of image patches is a well-known strategy in foreground segmentation [2, 7]. This approach needs to divide images into patches with enough size to contain useful information (e.g. 16 × 16), therefore the segmentation resolution is minimal. To use a tiling strat- egy to increase resolution is an approach to deal with that problem [8, 4] but the computational requirement rises with the segmentation resolution. The present paper proposes a foreground segmentation model based on deep convolutional autoencoders in noisy sequences in order to apply the autoen- coder strategy with no patch extraction, thus the segmentation resolution is increased due to avoiding the tiling. A compari- son between different structures and training strategies is also included in order to obtain insights about the best autoencoder definition approach. The remaining document is divided as follows: a method- ology section 2 with the new proposal, an experiment sec- tion 3 with implementation and experimentation details and results, and finally a conclusions section 4 is included.