Unsupervised Depth and Confidence Prediction from Monocular Images
using Bayesian Inference
Vishal Bhutani
1
, Madhu Vankadari
1
, Omprakash Jha
1
, Anima Majumder
1
, Swagat Kumar
2
and Samrat Dutta
1
Abstract—In this paper, we propose an unsupervised deep
learning framework with Bayesian inference for improving
the accuracy of per-pixel depth prediction from monocular
RGB images. The proposed framework predicts confidence map
along with depth and pose information for a given input image.
The depth hypotheses from previous frames are propagated for-
ward and fused with the depth hypothesis of the current frame
by using Bayesian inference mechanism. The ground truth
information required for training the confidence map prediction
is constructed using image reconstruction loss thereby obviating
the need for explicit ground truth depth information used
in supervised methods. The resulting unsupervised framework
is shown to outperform the existing state-of-the-art methods
for depth prediction on the publicly available KITTI outdoor
dataset. The usefulness of the proposed framework is further
established by demonstrating a real-world robotic pick-and-
place application where the pose of the robot end-effector
is computed using the depth predicted from an eye-in-hand
monocular camera. The design choices made for the proposed
framework is justified through extensive ablation studies.
I. I NTRODUCTION
Depth estimation from RGB images is an active field of
research finding application in a wide range of fields such as
Augmented Reality [1], 3D graphics [2],[3] and robotics [4] .
The deep learning based methods have been shown to outper-
form traditional methods that use hand crafted features and
exploit camera geometry and/or motion to estimate depth.
These learning based methods could be broadly classified
into two categories - supervised or unsupervised depending
on whether or not they require explicit ground truth depth in-
formation obtained from range sensors such as LiDAR. Since
the availability of explicit ground truth poses a constraint
which could not be met in many real world situations, there
is a growing interest for unsupervised learning methods over
the years aiming to overcome this limitation. These methods
exploit the temporal and/or spatial consistencies present in
the images to extract structural and motion information in
the absence of ground truth depth data [5], [6], [7], [8]. The
constraints for spatial consistency are derived from stereo
or multi-view images while the constraints for temporal
consistency is obtained by having a sequence of images.
Monocular methods [6], [9], [10] that rely only on temporal
consistency (optical flow motion) are shown to be inferior to
stereo methods [7], [11] that additionally incorporate spatial
consistency into their learning process. While the depth
prediction accuracies have been increasing over the years,
The authors are associated with TCS Research,
1
TATA Consultancy
Services, India and
2
Edge Hill University, UK. Email ID: {vk.bhutani,
madhu.vankadari, omprakash.jha, anima.majumder and
d.samrat}@tcs.com, swagat.kumar@edgehill.ac.uk
Fig. 1: A visual demonstration of depth estimation results applied to a
few RGB monocular images. (a) shows the input image to the network -
first two are randomly selected from KITTI outdoor dataset and last one is
taken from our own indoor dataset. (b) shows the per-pixel depth predicted
using our network (c) shows the confidence maps predicted by the network
and (d) shows the image reconstruction error (darker pixels indicate lower
error). The lighter regions in the confidence map signifies high confidence
of predicted depth and darker regions represent low confidence. One such
example can be observed in the highlighted rectangular region that, due to
reflective window of the car in the RGB image, the network is not able to
correctly predict disparity, thereby resulting into low confidence value.
there is still enough scope for improvement as the current
depth predictions are still not close to what is available from
range sensors.
It has been demonstrated recently that having a measure of
model uncertainty (or confidence) can greatly influence the
decision making process [12] [13]. The depth hypotheses
from previous frames could be propagated and combined
with the depth hypothesis for the current frame according
to their respective uncertainty maps to smooth out abrupt
errors, thereby, improving the accuracy of depth prediction
for the current frame [12]. These confidence maps (inverse
of model uncertainty) are predicted along with depth and
pose information from RGB images using deep networks.
The loss function necessary for training the confidence map
prediction usually requires ground truth depth information
as the supervision signal which might not be available in
many real world situations. This constrains the applicability
of the current approach and hence, provides motivation for
our work. Instead of using ground truth depth data, we
use reconstructed images (constructed using predicted depth
maps) to compute the loss function required for training
the confidence map prediction. The intuition for this comes
from the fact that the quality of image reconstruction usually
2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
October 25-29, 2020, Las Vegas, NV, USA (Virtual)
978-1-7281-6211-9/20/$31.00 ©2020 IEEE 10108