A LIGHTWEIGHT SELF-SUPERVISED TRAINING FRAMEWORK FOR MONOCULAR DEPTH ESTIMATION Tim Heydrich, Yimin Yang Lakehead University Department of Computer Science Thunder Bay, ON, Canada Shan Du * The University of British Columbia Okanagan Department of Computer Science, Mathematics, Physics and Statistics Kelowna, BC, Canada ABSTRACT Depth estimation attracts great interest in various sectors such as robotics, human computer interfaces, intelligent visual surveillance, and wearable augmented reality gear. Monoc- ular depth estimation is of particular interest due to its low complexity and cost. Research in recent years was shifted away from supervised learning towards unsupervised or self- supervised approaches. While there have been great achieve- ments, most of the research has focused on large heavy net- works which are highly resource intensive that makes them unsuitable for systems with limited resources. We are par- ticularly concerned about the increased complexity during training that current self-supervised approaches bring. In this paper, we propose a lightweight self-supervised training framework which utilizes computationally cheap methods to compute ground truth approximations. In particular, we uti- lize a stereo pair of images during training which are used to compute photometric reprojection loss and a disparity ground truth approximation. Due to the ground truth approximation, our framework is able to remove the need of pose estimation and the corresponding heavy prediction networks that current self-supervised methods have. In the experiments, we have demonstrated that our framework is capable of increasing the generator’s performance at a fraction of the size required by the current state-of-the-art self-supervised approach. Index Terms— computer vision, depth estimation, deep learning, self-supervised 1. INTRODUCTION Depth estimation is a fundamental and ill-posed problem of computer vision. There is a great interest in many areas from scene reconstruction [1] to augmented reality (AR) [2]. It has long been a key point of research, however, most traditional methods require multiple view points. Supervised learning of * Corresponding author: Shan Du (shan.du@ubc.ca). This work was sup- ported by the University of British Columbia Okanagan [GR017752], Lake- head University [11-50-16112406], and Vector Institute. large deep convolutional neural networks (CNNs) overcame this issue [3, 4, 5]. In addition, CNNs are used in combina- tion with passive sensors, cameras, which are usually cheaper and lighter than their active counterparts like LIDAR. How- ever, supervised learning has the problem of requiring large amount of labeled data for training. This data is available to some degree for certain speciﬁc scenarios such as inte- rior scenes with the NYUv2 dataset [6]. While there is data available there are some different scenes that are not at all or only limitedly covered in the datasets available. In order to allow for more versatile training, of various settings, unsu- pervised approaches were developed in recent years, the two most prominent being Godard et al. [7] and Zhou et al. [8]. Both of these approaches require two or more images during training but not during inference. Based on these proposed works, many new and improved training architectures were developed both unsupervised [9, 10] and self-supervised [2]. While most recent research has revolved around big heavy networks both for unsupervised and self-supervised networks, there are lightweight approaches out there such as MiniNet [10] and PyDNet [11]. Both MiniNet [10] as well as Mon- oDepthV2 [2] utilize secondary networks during training to boost their performance. In both cases, the networks are uti- lized to provide pose estimation between the images. These secondary networks are large heavy networks which are not utilized during inference. However, they drastically increase the complexity during training. We propose a novel self- supervised lightweight training framework which reduces the training complexity while maintaining an increase in perfor- mance. The training architecture we are proposing is speciﬁ- cally targeted for lightweight generator architectures. Our novel self-supervised framework, shown in Figure 1, is able to boost the generator’s performance while still main- taining a low complexity during training. Similar to other ap- proaches, our proposed framework utilizes two input images at training time to calculate a disparity map ground truth ap- proximation and compare it to the disparity prediction made by the generator. Furthermore, it is able to boost a lightweight target network’s performance. While our novel approach does 2265 978-1-6654-0540-9/22/$31.00 ©2022 IEEE ICASSP 2022 ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9747826