This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1 sEnDec: An Improved Image to Image CNN for Foreground Localization Thangarajah Akilan , Member, IEEE, and Q. M. Jonathan Wu , Senior Member, IEEE Abstract—Although it is not immediately intuitive that Deep Convo- lutional Neural Networks (DCNNs) can yield adequate feature represen- tation for a Foreground Localization (FGL) task, recent architectural and algorithmic advancements in Deep Learning (DL) have shown that the DCNNs have become forefront methodology for this pixel-level classification problem. In FGL, the DCNNs face an inherent trade-off between moving objects, i.e., the foreground (FG) and the non-static background (BG) scenes, through learning from local- and global-level features. Driven by the latest success of the innovative structures for image classification and semantic segmentation, this work introduces a novel architecture, called Slow Encoder-Decoder (sEnDec) that aims to improve the learning capacity of a traditional image-to-image DCNN. The proposed model subsumes two subnets for contraction (encoding) and expansion (decoding), wherein both phases, it employs an intermediate feature map up-sampling and residual connections. In this way, the lost structural details due to spatial subsampling are recovered. It helps to get a more delineated FG region. The experimental study is carried out with two variants of the proposed model: one with strided convolution (conv) and the other with max pooling for spatial subsampling. A comparative analysis on sixteen benchmark video sequences, including baseline, dynamic background, camera jitter, shadow effects, intermittent object motion, night videos, and bad weather show that the proposed sEnDec model performs very competitively against the prior- and state-of-the-art approaches. Index Terms— Foreground localization, DCNN, encoder-decoder network. I. I NTRODUCTION F OREGROUND localization is a fundamental task in various computer-vision (CV) problems, like salient object detection and recognition [1], content-aware image/video processing [2], object segmentation [3], foreground object extraction, signature extension in satellite remote sensing [4], visual tracking [5], [6] object discov- ery [7], human-robot interaction [8], and autonomous driving [9]. The main objective of FGL is to place a tight binary mask on the most probable region of pixels of the moving objects in the scene. Such mask, in many ways, is very informative than a simple detection with bounding box as it creates a close localization of FG objects. The FGL can be formulated mathematically as F z = I z - (1 - α z ) B z α z , (1) where for an observed pixel z , the F z , I z , B z , and α z are the received color, FG color, BG color, and alpha parameter of the capturing device respectively. It has been automated with myriad algorithms, including graph-cut that requires a user supplied scribble or box on the FG and BG [10], probabilistic models like Gaussian Mixture Models (GMM) [11] and top-down approaches that firstly detect objects then classify pixels inside the detected object boundary based on shape priors [12]. Recently, DCNN-based approaches, like the Manuscript received February 22, 2018; revised October 28, 2018 and August 19, 2019; accepted August 28, 2019. The Associate Editor for this article was H. Huang. (Corresponding author: Q. M. Jonathan Wu.) T. Akilan is with the Department of Computer Science, Lakehead University, Thunder Bay, ON P7B 5E1, Canada (e-mail: takilan@lakeheadu.ca). Q. M. J. Wu is with the Centre for Computer Vision and Deep Learning, Department of Electrical and Computer Engineering, University of Windsor, Windsor, ON N9B 2P4, Canada (e-mail: jwu@uwindsor.ca). Digital Object Identifier 10.1109/TITS.2019.2940547 Fig. 1. Block diagram of a basic image-to-image CNN. image-to-image basic architecture depicted in Fig. 1 for localizing FG regions in video sequences have gained wider adoption [13]–[16]. The exploitation of Neural Networks (NN) for visual-based prob- lems possess a long history; arguably, from one of the pioneer computer vision systems, the Mark I Perceptron machine by Rosen- blatt in late 1950s [17]. Presumably, concurrent with that Hubel and Wiesel’s [18] discovery of neural connectivity pattern of cat’s visual cortex inspires Fukushima to come up with a network, coined Neocognitron [19], which is invariant to image translations. The Neocognitron devised with a backpropagation mechanism paved a way for the modern-day DCNN, a multi-layered NN that integrates layers of several convolution, rectification, sub-sampling, and nor- malization operations. In which, the low-level conv layers operate similar to Gabor filters and color blob detectors [20] that extract primitive information, like edges and textures, while the top-level layers provide the abstractive meaning of the input visuals, like shapes and structures. Unlike the traditional machine learning (ML) theories, the DCNNs emphasize automatic feature-extraction and learning, relatively from a large amount of data. The practical theories of advanced CNN architectures were proposed by Hinton et al. [21]. 1524-9050 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.