Unsupervised Depth and Conﬁdence Prediction from Monocular Images using Bayesian Inference Vishal Bhutani 1 , Madhu Vankadari 1 , Omprakash Jha 1 , Anima Majumder 1 , Swagat Kumar 2 and Samrat Dutta 1 Abstract—In this paper, we propose an unsupervised deep learning framework with Bayesian inference for improving the accuracy of per-pixel depth prediction from monocular RGB images. The proposed framework predicts conﬁdence map along with depth and pose information for a given input image. The depth hypotheses from previous frames are propagated for- ward and fused with the depth hypothesis of the current frame by using Bayesian inference mechanism. The ground truth information required for training the conﬁdence map prediction is constructed using image reconstruction loss thereby obviating the need for explicit ground truth depth information used in supervised methods. The resulting unsupervised framework is shown to outperform the existing state-of-the-art methods for depth prediction on the publicly available KITTI outdoor dataset. The usefulness of the proposed framework is further established by demonstrating a real-world robotic pick-and- place application where the pose of the robot end-effector is computed using the depth predicted from an eye-in-hand monocular camera. The design choices made for the proposed framework is justiﬁed through extensive ablation studies. I. I NTRODUCTION Depth estimation from RGB images is an active ﬁeld of research ﬁnding application in a wide range of ﬁelds such as Augmented Reality [1], 3D graphics [2],[3] and robotics [4] . The deep learning based methods have been shown to outper- form traditional methods that use hand crafted features and exploit camera geometry and/or motion to estimate depth. These learning based methods could be broadly classiﬁed into two categories - supervised or unsupervised depending on whether or not they require explicit ground truth depth in- formation obtained from range sensors such as LiDAR. Since the availability of explicit ground truth poses a constraint which could not be met in many real world situations, there is a growing interest for unsupervised learning methods over the years aiming to overcome this limitation. These methods exploit the temporal and/or spatial consistencies present in the images to extract structural and motion information in the absence of ground truth depth data [5], [6], [7], [8]. The constraints for spatial consistency are derived from stereo or multi-view images while the constraints for temporal consistency is obtained by having a sequence of images. Monocular methods [6], [9], [10] that rely only on temporal consistency (optical ﬂow motion) are shown to be inferior to stereo methods [7], [11] that additionally incorporate spatial consistency into their learning process. While the depth prediction accuracies have been increasing over the years, The authors are associated with TCS Research, 1 TATA Consultancy Services, India and 2 Edge Hill University, UK. Email ID: {vk.bhutani, madhu.vankadari, omprakash.jha, anima.majumder and d.samrat}@tcs.com, swagat.kumar@edgehill.ac.uk Fig. 1: A visual demonstration of depth estimation results applied to a few RGB monocular images. (a) shows the input image to the network - ﬁrst two are randomly selected from KITTI outdoor dataset and last one is taken from our own indoor dataset. (b) shows the per-pixel depth predicted using our network (c) shows the conﬁdence maps predicted by the network and (d) shows the image reconstruction error (darker pixels indicate lower error). The lighter regions in the conﬁdence map signiﬁes high conﬁdence of predicted depth and darker regions represent low conﬁdence. One such example can be observed in the highlighted rectangular region that, due to reﬂective window of the car in the RGB image, the network is not able to correctly predict disparity, thereby resulting into low conﬁdence value. there is still enough scope for improvement as the current depth predictions are still not close to what is available from range sensors. It has been demonstrated recently that having a measure of model uncertainty (or conﬁdence) can greatly inﬂuence the decision making process [12] [13]. The depth hypotheses from previous frames could be propagated and combined with the depth hypothesis for the current frame according to their respective uncertainty maps to smooth out abrupt errors, thereby, improving the accuracy of depth prediction for the current frame [12]. These conﬁdence maps (inverse of model uncertainty) are predicted along with depth and pose information from RGB images using deep networks. The loss function necessary for training the conﬁdence map prediction usually requires ground truth depth information as the supervision signal which might not be available in many real world situations. This constrains the applicability of the current approach and hence, provides motivation for our work. Instead of using ground truth depth data, we use reconstructed images (constructed using predicted depth maps) to compute the loss function required for training the conﬁdence map prediction. The intuition for this comes from the fact that the quality of image reconstruction usually 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) October 25-29, 2020, Las Vegas, NV, USA (Virtual) 978-1-7281-6211-9/20/$31.00 ©2020 IEEE 10108