Fully Convolutional Network for Depth Estimation and Semantic Segmentation Yokila Arora ICME Stanford University yarora@stanford.edu Ishan Patil Department of Electrical Engineering Stanford University iapatil@stanford.edu Thao Nguyen Department of Computer Science Stanford University thao2605@stanford.edu Abstract Scene understanding is an active area of research in computer vision that encompasses several different prob- lems. This paper addresses two of those problems: we use a fully convolutional network architecture to perform, in a single pass, both depth prediction of a scene from a single monocular image, and pixel-wise semantic labeling using the same image input and its depth information. We op- timize the first task on L2 and berHu loss, and the latter on negative log likelihood loss per pixel. Our model en- compasses residual blocks and efficient up-sampling units to provide high-resolution outputs, thus removing the need for post-processing steps. We achieve reasonable validation accuracies of 49% and 66% in the semantic labeling task, when using 38 and 6 classes respectively. 1. Introduction Predicting depth is crucial to understanding the physi- cal geometry of a scene. The more challenging problem is learning this geometry from a single monocular image in the absence of any environmental assumptions, due to the ambiguity of mapping color intensity or illumination to a depth value. Developing an accurate real-time network for generating pixelwise depth regression is an ill-posed task, but a crucial one for automated systems where depth sens- ing is not available. In addition, other tasks in computer vision can also benefit greatly from having depth informa- tion, as we have shown with semantic segmentation in our project. We adapted our model from the one proposed by Laina et al. [10] and implemented a joint architecture in PyTorch 0 Equal contribution for both depth estimation and semantic segmentation tasks. The inputs to our model consist of RGB-D images from the NYU Depth v2 dataset and their corresponding ground- truth depth maps, whereas the outputs contain a predicted depth map and semantic labels (for 6 and 38 most frequent labels in the aforementioned dataset) for each input image. The model is able to learn directly from data without any specific scene-dependent knowledge. The absence of post- processing steps and fully connected layers helps reduce the number of parameters of our model as well as the number of training examples required, while still ensuring reason- able performance. We also combine the concepts of up- convolution and residual learning to create up-projection units that allow more efficient up-sampling of feature maps, which are essential to increasing the resolution as well as accuracy of the output image. In this paper, we will analyze the influence of different variables (loss function, learning rates, etc.) on the performance of the model, in addition to the results it generates on standard benchmarks. 2. Related Work Convolutional Neural Networks (CNNs) are being widely used for tasks like image recognition, object classi- fication and natural language processing. Eigen et al. [3] were the first to use CNNs for depth estimation. They present a multi-scale deep network, which first predicts a coarse global output and then a finer local network. A re- cent paper by Eigen et al. [2] extends this model on two other tasks, namely surface normal estimation and seman- tic labeling, and achieves state-of-the-art results on all three tasks. They develop a general network model using a se- quence of three-scales, based on AlexNet [9] and the Ox- ford VGG network [14]. In [4], Farabet et al. propose a multi-scale convolutional network for scene labeling from the raw input images by 1