1
3D Object Pose Estimation using Multi-Objective
Quaternion Learning
Christos Papaioannidis and Ioannis Pitas, Fellow, IEEE
Abstract—In this work, a framework is proposed for object
recognition and pose estimation from color images using con-
volutional neural networks (CNNs). 3D object pose estimation
along with object recognition has numerous applications, such as
robot positioning vs a target object and robotic object grasping.
Previous methods addressing this problem relied on both color
and depth (RGB-D) images to learn low-dimensional viewpoint
descriptors for object pose retrieval. In the proposed method,
a novel quaternion-based multi-objective loss function is used,
which combines manifold learning and regression to learn 3D
pose descriptors and direct 3D object pose estimation, using only
color (RGB) images. The 3D object pose can then be obtained
either by using the learned descriptors in a Nearest Neighbor
(NN) search, or by direct neural network regression. An extensive
experimental evaluation has proven that such descriptors provide
greater pose estimation accuracy compared to state-of-the-art
methods. In addition, the learned 3D pose descriptors are almost
object-independent and, thus, generalizable to unseen objects.
Finally, when the object identity is not of interest, the 3D
object pose can be regressed directly from the network, by
overriding the NN search, thus, significantly reducing the object
pose inference time.
Index Terms—3D object pose estimation, convolutional neural
networks, multi-objective learning, object recognition, quater-
nion.
I. I NTRODUCTION
O
BJECT recognition and 3D pose estimation is a very
challenging computer vision task. It has been heavily
researched recently, due to its importance in robotics and
augmented reality applications. However, there is still a large
room for improvement, as occlusion, background clutter, scale
and illumination variations highly affect object appearance,
and, hence, reduce pose estimation accuracy.
3D object pose estimation typically derives object orienta-
tion in a camera coordinate system (O
c
,X
c
,Y
c
,Z
c
), e.g. in a
form of a quaternion q ∈ R
4
. The rotation R ∈ R
3×3
between
the object coordinate system (O
o
,X
o
,Y
o
,Z
o
) and the camera
coordinate system can be defined by a unit quaternion, as
shown in Fig. 1. The 3D object pose estimation problem can be
considered as a regression problem, if q is continuous over R
4
[1]–[3] or as a classification problem, if the 3D pose space has
been quantized in a predefined number of orientation classes
[4]–[6]. An alternative approach to 3D object pose estimation
Christos Papaioannidis and Ioannis Pitas are with the Department of
Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece.
e-mail: cpapaionn@csd.auth.gr, pitas@aiia.csd.auth.gr
This project has received funding from the European Union’s Horizon
2020 research and innovation programme under grant agreement No 731667
(MULTIDRONE). This publication reflects the authors’ views only. The
European Commission is not responsible for any use that may be made of
the information it contains.
is transforming the 3D object pose regression problem into a
nearest neighbor (NN) one, by matching hand-crafted [7] or
extracted [8]–[11] image descriptors with a set of orientation
class templates via NN search. It has to be mentioned that the
3D object pose estimation problem addressed by this work
is a sub-case of 6D object pose estimation, where both the
rotation R ∈ R
3×3
and translation T ∈ R
3
between the
object coordinate system and the camera coordinate system
are estimated. Also note that, in this paper, we focus on
rigid object pose estimation, and articulated objects are not
considered (e.g., human body) [12].
Both classification and regression are typical machine learn-
ing problems. Deep learning and especially Convolutional
Neural Networks (CNNs) [13] showed remarkable perfor-
mance in such computer vision tasks, e.g. object detection
[14]–[16], recognition [17], [18] and instance segmentation
[19]. Deep CNNs were also successfully used for 6D object
pose estimation [20]–[24], where both the 3D rotation and 3D
translation of the object are estimated. CNNs usually require a
huge amount of training data, and there is limited availability
of object images annotated with their ground truth 3D pose,
due to the inherent difficulty in estimating such a ground
truth. However, 3D object models, if available, can be used
to create large amounts of synthetic object images along with
their ground truth poses for CNN training [8]–[11], [25]. In the
proposed method, a lightweight CNN model is trained using
both real and synthetic color object images.
Since most pose estimation methods rely on deep network
architectures and/or RGB-D data, our goal is to offer a
lightweight and reliable RGB only-based 3D object pose
estimation method, which can be utilized in embedded sys-
tems. Inspired by [10], the proposed method utilizes siamese
and triplet CNNs to calculate 3D object pose features. By
combining manifold learning and regression, the CNN learns
to produce pose features from which both the object identity
and 3D pose can be inferred. However, in contrast to [10],
the proposed CNN model is forced to learn features, whose
distance in the feature space is proportional to the correspond-
ing quaternion distance. To this end, a novel quaternion-based
multi-objective loss function is proposed, which combines
the strengths of both manifold learning and regression. The
trained CNN model demonstrates state-of-the-art 3D object
pose estimation accuracy along with object classification. In
addition, the object identity and 3D pose are estimated in real
time, hence, rendering it suitable for embedded computing
in autonomous robotic systems, such as drones. In drone
cinematography [26]–[34], 3D target (object) pose estimation
is essential for autonomous navigation and visual drone control
Copyright
©
2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from
the IEEE by sending an email to pubs-permissions@ieee.org. Please cite the publisher maintained version in your work. DOI: 10.1109/TCSVT.2019.2929600