A LIGHTWEIGHT SELF-SUPERVISED TRAINING FRAMEWORK FOR MONOCULAR
DEPTH ESTIMATION
Tim Heydrich, Yimin Yang
Lakehead University
Department of Computer Science
Thunder Bay, ON, Canada
Shan Du
*
The University of British Columbia Okanagan
Department of Computer Science,
Mathematics, Physics and Statistics
Kelowna, BC, Canada
ABSTRACT
Depth estimation attracts great interest in various sectors such
as robotics, human computer interfaces, intelligent visual
surveillance, and wearable augmented reality gear. Monoc-
ular depth estimation is of particular interest due to its low
complexity and cost. Research in recent years was shifted
away from supervised learning towards unsupervised or self-
supervised approaches. While there have been great achieve-
ments, most of the research has focused on large heavy net-
works which are highly resource intensive that makes them
unsuitable for systems with limited resources. We are par-
ticularly concerned about the increased complexity during
training that current self-supervised approaches bring. In
this paper, we propose a lightweight self-supervised training
framework which utilizes computationally cheap methods to
compute ground truth approximations. In particular, we uti-
lize a stereo pair of images during training which are used to
compute photometric reprojection loss and a disparity ground
truth approximation. Due to the ground truth approximation,
our framework is able to remove the need of pose estimation
and the corresponding heavy prediction networks that current
self-supervised methods have. In the experiments, we have
demonstrated that our framework is capable of increasing the
generator’s performance at a fraction of the size required by
the current state-of-the-art self-supervised approach.
Index Terms— computer vision, depth estimation, deep
learning, self-supervised
1. INTRODUCTION
Depth estimation is a fundamental and ill-posed problem of
computer vision. There is a great interest in many areas from
scene reconstruction [1] to augmented reality (AR) [2]. It has
long been a key point of research, however, most traditional
methods require multiple view points. Supervised learning of
*
Corresponding author: Shan Du (shan.du@ubc.ca). This work was sup-
ported by the University of British Columbia Okanagan [GR017752], Lake-
head University [11-50-16112406], and Vector Institute.
large deep convolutional neural networks (CNNs) overcame
this issue [3, 4, 5]. In addition, CNNs are used in combina-
tion with passive sensors, cameras, which are usually cheaper
and lighter than their active counterparts like LIDAR. How-
ever, supervised learning has the problem of requiring large
amount of labeled data for training. This data is available
to some degree for certain specific scenarios such as inte-
rior scenes with the NYUv2 dataset [6]. While there is data
available there are some different scenes that are not at all or
only limitedly covered in the datasets available. In order to
allow for more versatile training, of various settings, unsu-
pervised approaches were developed in recent years, the two
most prominent being Godard et al. [7] and Zhou et al. [8].
Both of these approaches require two or more images during
training but not during inference. Based on these proposed
works, many new and improved training architectures were
developed both unsupervised [9, 10] and self-supervised [2].
While most recent research has revolved around big heavy
networks both for unsupervised and self-supervised networks,
there are lightweight approaches out there such as MiniNet
[10] and PyDNet [11]. Both MiniNet [10] as well as Mon-
oDepthV2 [2] utilize secondary networks during training to
boost their performance. In both cases, the networks are uti-
lized to provide pose estimation between the images. These
secondary networks are large heavy networks which are not
utilized during inference. However, they drastically increase
the complexity during training. We propose a novel self-
supervised lightweight training framework which reduces the
training complexity while maintaining an increase in perfor-
mance. The training architecture we are proposing is specifi-
cally targeted for lightweight generator architectures.
Our novel self-supervised framework, shown in Figure 1,
is able to boost the generator’s performance while still main-
taining a low complexity during training. Similar to other ap-
proaches, our proposed framework utilizes two input images
at training time to calculate a disparity map ground truth ap-
proximation and compare it to the disparity prediction made
by the generator. Furthermore, it is able to boost a lightweight
target network’s performance. While our novel approach does
2265 978-1-6654-0540-9/22/$31.00 ©2022 IEEE ICASSP 2022
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9747826