3D Object Detection and 6D Pose Estimation Using
RGB-D Images and Mask R-CNN
Van Luan Tran
Department of Electrical Engineering
National Chung Cheng University
Chiayi 621, Taiwan
tranvanluan07118@gmail.com
Huei-Yung Lin
Department of Electrical Engineering
National Chung Cheng University
Chiayi 621, Taiwan
lin@ee.ccu.edu.tw
Abstract—Understanding 3D scenes have attracted significant
interests in recent years. Specifically, it is used with visual
sensors to provide the information for a robotic manipulator
to interact with the target object. Thus, 6D pose estimation
and object recognition from point clouds or RGB-D images are
important tasks for visual servoing. In this paper, we propose
a learning based approach to perform 6D pose estimation for
robotic manipulation using Mask R-CNN and the structured
light technique. The proposed technique optimizes the 6D pose
between the target objects and 3D CAD models in multi-layers.
Our method is evaluated on a publicly available dataset for 6D
pose estimation and shows its efficiency in computation time. The
experimental results demonstrate the feasibility of the random
bin picking application.
Index Terms—3D pose estimation, RGB-D image, structured
light system, object detection
I. I NTRODUCTION
In recent years, deep learning has shown its effectiveness
for robot vision, especially with object detection and scene
understanding. On the other hand, with the rapid progress
in warehouse automation technologies, 6D pose estimation
is an important task for a robotic manipulator to determine
the exact position of the target object [10]. For instance,
recognizing the 3D position and orientation of the objects in
a scene is essential for a robotic manipulator. It is also useful
in robot vision interaction tasks such as learning from real
environments. Nevertheless, the problem is challenging due
to the changing position of objects in the real environment.
They appear to have different 3D shapes due to the lighting
conditions and distortion, and their appearances on RGB-D
images are affected by the occlusion between different objects.
Traditionally, 6D pose estimation of an object is performed
by matching feature points between the 3D model and the 3D
scene [4]. However, these methods require that there are plenty
of features on the objects to be detected for matching. As a
result, they have the limitation for feature-poor or textureless
objects. Currently, with the development of deep learning,
the state-of-the-art 3D recognition systems usually adopt two
strategies: (1) Recognize the object in a scene with RGB
images and project the 3D CAD model of the object to
determine its 6D pose [14], [23]. (2) Use 3D deep learning
methods for detection, segmentation, and classification to
understand the 3D scene with RGB-D images and point clouds
[15], [20].
Recently, 3D deep learning has become very popular for
object detection and pose estimation by understanding the 3D
scene. Specifically, Zeng et al. proposed a multi-view self-
supervised deep learning technique for 6D pose estimation
in the Amazon Picking Challenge [25]. They segmented and
labeled multiple views of a scene with a fully convolutional
neural network, and then fitted pre-scanned 3D object models
to the resulting segmentation to derive the 6D object pose.
Training a deep neural network for segmentation typically
requires a large amount of training data. Typical computer
vision algorithms operate on single images for segmentation
and recognition. The robotic arms free us from that constraint
by allowing us to precisely fuse multiple views to improve
the performance in cluttered environments. However, deep
learning in 6D object pose estimation approaches requires
significant improvement for the development to make it simple
and accurate.
In this paper, we propose an efficient technique to perform
6D pose estimation using the framework summarized in Fig.
1. It is based on Mask-RCNN with the input of RGB images
and depth information. Mask-RCNN is a deep neural network
trained to solve instance segmentation problem in computer
vision [6]. The network detects objects and separates different
objects in an image while concurrently producing a high-
quality segmentation mask for each object. With the mask of
each object, we accumulate the objects in multiple layers and
then apply 3D triangulation from a structured light system
to calculate 3D point cloud of each object. For the 6D pose
estimation in multi-layer, we use a 3D CAD model to optimize
the object pose with the ICP algorithm. It is used to find the
best transformation that minimizes the distance from the multi-
layer source point clouds with the 3D CAD model.
II. RELATED WORK
The main research topics of robot vision include accurate
indoor object positioning systems for robotic manipulators
(positioning the object), sensor based safety systems, the
interaction between human and robot (machine vision), higher
levels of realism in vision robot system (3D segmentation,
978-1-7281-6932-3/20/$31.00 ©2020 IEEE