Single-Shot 3D Detection of Vehicles from Monocular RGB Images via Geometry Constrained Keypoints in Real-Time Nils G¨ ahlert 1 , Jun-Jun Wan 2 , Nicolas Jourdan 3 , Jan Finkbeiner 4 , Uwe Franke 5 and Joachim Denzler 6 c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Abstract— In this paper we propose a novel 3D single-shot object detection method for detecting vehicles in monocular RGB images. Our approach lifts 2D detections to 3D space by predicting additional regression and classification parameters and hence keeping the runtime close to pure 2D object detection. The additional parameters are transformed to 3D bounding box keypoints within the network under geometric constraints. Our proposed method features a full 3D description including all three angles of rotation without supervision by any labeled ground truth data for the object’s orientation, as it focuses on certain keypoints within the image plane. While our approach can be combined with any modern object detection framework with only little computational overhead, we exemplify the extension of SSD for the prediction of 3D bounding boxes. We test our approach on different datasets for autonomous driving and evaluate it using the challenging KITTI 3D Object Detection as well as the novel nuScenes Object Detection benchmarks. While we achieve competitive results on both benchmarks we outperform current state-of-the-art methods in terms of speed with more than 20 FPS for all tested datasets and image resolutions. I. INTRODUCTION Object detection – both in 2D as well as in 3D – is a key enabler for autonomous driving systems. To this end, autonomous vehicles that are currently in development as well as consumer cars that provide advanced driver assistance systems are equipped with a decent set of sensors such as RGB cameras, LiDARs as well as Radar systems. While the accurate distance measurement of LiDAR sensors enables robust 3D bounding box detection, the high cost may prohibit their use in series production vehicles. 3D object detection from monocular RGB cameras thus became a focus in recent computer vision research. In contrast to LiDAR measure- ments, RGB images provide rich semantic information that can be used to boost object classification. One of the most challenging problems in 3D object detection from monocular RGB images is the missing depth information. A neural net- work thus needs to accurately estimate depth from monocular RGB images. Furthermore, real-time performance is required to enable the use of an algorithm in an autonomous vehicle. In this paper we present a novel technique to detect vehi- cles as 3D bounding boxes from monocular RGB images by 1 Mercedes-Benz AG, University of Jena, nils.gaehlert@daimler.com 2 Robert Bosch GmbH, kuanih.junjun.wan@gmail.com 3 Mercedes-Benz AG & TU Darmstadt, n.jourdan@ptw.tu-darmstadt.de 4 Mercedes-Benz AG, jan.finkbeiner@daimler.com 5 Mercedes-Benz AG, uwe.franke@daimler.com 6 Computer Vision Group, University of Jena, joachim.denzler@uni-jena.de Fig. 1. Exemplary result of 3D-GCK for an image taken from the nuScenes test set [1]. transforming a set of predicted regression and classification parameters to geometrically constrain 3D keypoints called 3D-GCK. In contrast to other 3D bounding box estimators 3D-GCK is capable of predicting all three angles of rotation (θ,ψ,φ) which are required for a full description of a 3D bounding box. 3D-GCK focuses only on keypoints in the image plane and exploits the projection properties to generate 3D orientation information. Hence, no labeled ground truth for the angles of rotation is required to train 3D-GCK which facilitates the collection of training data. We use a standard single-shot 2D object detection frame- work – in our case SSD [2] – and add the proposed extension to lift the predicted 2D bounding boxes from image space to 3D bounding boxes. Lifting 2D bounding boxes to 3D space can be done with minimal computational overhead leading to real-time capable performance. We summarize our contributions as follows: 1) We in- troduce 3D-GCK which can be used with all current state- of-the-art 2D object detection frameworks such as SSD [2], Yolo [3] and Faster-RCNN [4] to detect vehicles and lift their 2D bounding boxes to 3D space. 2) We exemplary extend SSD with the proposed 3D-GCK architecture to accentuate the practical use of 3D-GCK. 3) We evaluate 3D-GCK on 4 challenging and diverse datasets which are especially tailored for autonomous driving: KITTI, nuScenes, A2D2 and Synscapes. We achieve competitive results on the publicly available KITTI 3D Object Detection and nuScenes Object Detection benchmarks. At the same time 3D-GCK is the fastest 3D object detection framework that relies exclusively on monocular RGB images. arXiv:2006.13084v1 [cs.CV] 23 Jun 2020