3D Object Detection and 6D Pose Estimation Using RGB-D Images and Mask R-CNN Van Luan Tran Department of Electrical Engineering National Chung Cheng University Chiayi 621, Taiwan tranvanluan07118@gmail.com Huei-Yung Lin Department of Electrical Engineering National Chung Cheng University Chiayi 621, Taiwan lin@ee.ccu.edu.tw Abstract—Understanding 3D scenes have attracted signiﬁcant interests in recent years. Speciﬁcally, it is used with visual sensors to provide the information for a robotic manipulator to interact with the target object. Thus, 6D pose estimation and object recognition from point clouds or RGB-D images are important tasks for visual servoing. In this paper, we propose a learning based approach to perform 6D pose estimation for robotic manipulation using Mask R-CNN and the structured light technique. The proposed technique optimizes the 6D pose between the target objects and 3D CAD models in multi-layers. Our method is evaluated on a publicly available dataset for 6D pose estimation and shows its efﬁciency in computation time. The experimental results demonstrate the feasibility of the random bin picking application. Index Terms—3D pose estimation, RGB-D image, structured light system, object detection I. I NTRODUCTION In recent years, deep learning has shown its effectiveness for robot vision, especially with object detection and scene understanding. On the other hand, with the rapid progress in warehouse automation technologies, 6D pose estimation is an important task for a robotic manipulator to determine the exact position of the target object [10]. For instance, recognizing the 3D position and orientation of the objects in a scene is essential for a robotic manipulator. It is also useful in robot vision interaction tasks such as learning from real environments. Nevertheless, the problem is challenging due to the changing position of objects in the real environment. They appear to have different 3D shapes due to the lighting conditions and distortion, and their appearances on RGB-D images are affected by the occlusion between different objects. Traditionally, 6D pose estimation of an object is performed by matching feature points between the 3D model and the 3D scene [4]. However, these methods require that there are plenty of features on the objects to be detected for matching. As a result, they have the limitation for feature-poor or textureless objects. Currently, with the development of deep learning, the state-of-the-art 3D recognition systems usually adopt two strategies: (1) Recognize the object in a scene with RGB images and project the 3D CAD model of the object to determine its 6D pose [14], [23]. (2) Use 3D deep learning methods for detection, segmentation, and classiﬁcation to understand the 3D scene with RGB-D images and point clouds [15], [20]. Recently, 3D deep learning has become very popular for object detection and pose estimation by understanding the 3D scene. Speciﬁcally, Zeng et al. proposed a multi-view self- supervised deep learning technique for 6D pose estimation in the Amazon Picking Challenge [25]. They segmented and labeled multiple views of a scene with a fully convolutional neural network, and then ﬁtted pre-scanned 3D object models to the resulting segmentation to derive the 6D object pose. Training a deep neural network for segmentation typically requires a large amount of training data. Typical computer vision algorithms operate on single images for segmentation and recognition. The robotic arms free us from that constraint by allowing us to precisely fuse multiple views to improve the performance in cluttered environments. However, deep learning in 6D object pose estimation approaches requires signiﬁcant improvement for the development to make it simple and accurate. In this paper, we propose an efﬁcient technique to perform 6D pose estimation using the framework summarized in Fig. 1. It is based on Mask-RCNN with the input of RGB images and depth information. Mask-RCNN is a deep neural network trained to solve instance segmentation problem in computer vision [6]. The network detects objects and separates different objects in an image while concurrently producing a high- quality segmentation mask for each object. With the mask of each object, we accumulate the objects in multiple layers and then apply 3D triangulation from a structured light system to calculate 3D point cloud of each object. For the 6D pose estimation in multi-layer, we use a 3D CAD model to optimize the object pose with the ICP algorithm. It is used to ﬁnd the best transformation that minimizes the distance from the multi- layer source point clouds with the 3D CAD model. II. RELATED WORK The main research topics of robot vision include accurate indoor object positioning systems for robotic manipulators (positioning the object), sensor based safety systems, the interaction between human and robot (machine vision), higher levels of realism in vision robot system (3D segmentation, 978-1-7281-6932-3/20/$31.00 ©2020 IEEE