This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1
sEnDec: An Improved Image to Image CNN for Foreground Localization
Thangarajah Akilan , Member, IEEE, and Q. M. Jonathan Wu , Senior Member, IEEE
Abstract—Although it is not immediately intuitive that Deep Convo-
lutional Neural Networks (DCNNs) can yield adequate feature represen-
tation for a Foreground Localization (FGL) task, recent architectural
and algorithmic advancements in Deep Learning (DL) have shown
that the DCNNs have become forefront methodology for this pixel-level
classification problem. In FGL, the DCNNs face an inherent trade-off
between moving objects, i.e., the foreground (FG) and the non-static
background (BG) scenes, through learning from local- and global-level
features. Driven by the latest success of the innovative structures for
image classification and semantic segmentation, this work introduces a
novel architecture, called Slow Encoder-Decoder (sEnDec) that aims to
improve the learning capacity of a traditional image-to-image DCNN. The
proposed model subsumes two subnets for contraction (encoding) and
expansion (decoding), wherein both phases, it employs an intermediate
feature map up-sampling and residual connections. In this way, the lost
structural details due to spatial subsampling are recovered. It helps to get
a more delineated FG region. The experimental study is carried out with
two variants of the proposed model: one with strided convolution (conv)
and the other with max pooling for spatial subsampling. A comparative
analysis on sixteen benchmark video sequences, including baseline,
dynamic background, camera jitter, shadow effects, intermittent object
motion, night videos, and bad weather show that the proposed sEnDec
model performs very competitively against the prior- and state-of-the-art
approaches.
Index Terms— Foreground localization, DCNN, encoder-decoder
network.
I. I NTRODUCTION
F
OREGROUND localization is a fundamental task in various
computer-vision (CV) problems, like salient object detection
and recognition [1], content-aware image/video processing [2], object
segmentation [3], foreground object extraction, signature extension
in satellite remote sensing [4], visual tracking [5], [6] object discov-
ery [7], human-robot interaction [8], and autonomous driving [9]. The
main objective of FGL is to place a tight binary mask on the most
probable region of pixels of the moving objects in the scene. Such
mask, in many ways, is very informative than a simple detection with
bounding box as it creates a close localization of FG objects. The
FGL can be formulated mathematically as
F
z
=
I
z
- (1 - α
z
) B
z
α
z
, (1)
where for an observed pixel z , the F
z
, I
z
, B
z
, and α
z
are the received
color, FG color, BG color, and alpha parameter of the capturing
device respectively. It has been automated with myriad algorithms,
including graph-cut that requires a user supplied scribble or box on
the FG and BG [10], probabilistic models like Gaussian Mixture
Models (GMM) [11] and top-down approaches that firstly detect
objects then classify pixels inside the detected object boundary based
on shape priors [12]. Recently, DCNN-based approaches, like the
Manuscript received February 22, 2018; revised October 28, 2018 and
August 19, 2019; accepted August 28, 2019. The Associate Editor for this
article was H. Huang. (Corresponding author: Q. M. Jonathan Wu.)
T. Akilan is with the Department of Computer Science, Lakehead University,
Thunder Bay, ON P7B 5E1, Canada (e-mail: takilan@lakeheadu.ca).
Q. M. J. Wu is with the Centre for Computer Vision and Deep Learning,
Department of Electrical and Computer Engineering, University of Windsor,
Windsor, ON N9B 2P4, Canada (e-mail: jwu@uwindsor.ca).
Digital Object Identifier 10.1109/TITS.2019.2940547
Fig. 1. Block diagram of a basic image-to-image CNN.
image-to-image basic architecture depicted in Fig. 1 for localizing
FG regions in video sequences have gained wider adoption [13]–[16].
The exploitation of Neural Networks (NN) for visual-based prob-
lems possess a long history; arguably, from one of the pioneer
computer vision systems, the Mark I Perceptron machine by Rosen-
blatt in late 1950s [17]. Presumably, concurrent with that Hubel
and Wiesel’s [18] discovery of neural connectivity pattern of cat’s
visual cortex inspires Fukushima to come up with a network, coined
Neocognitron [19], which is invariant to image translations. The
Neocognitron devised with a backpropagation mechanism paved a
way for the modern-day DCNN, a multi-layered NN that integrates
layers of several convolution, rectification, sub-sampling, and nor-
malization operations. In which, the low-level conv layers operate
similar to Gabor filters and color blob detectors [20] that extract
primitive information, like edges and textures, while the top-level
layers provide the abstractive meaning of the input visuals, like shapes
and structures. Unlike the traditional machine learning (ML) theories,
the DCNNs emphasize automatic feature-extraction and learning,
relatively from a large amount of data. The practical theories of
advanced CNN architectures were proposed by Hinton et al. [21].
1524-9050 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.