1 3D Object Pose Estimation using Multi-Objective Quaternion Learning Christos Papaioannidis and Ioannis Pitas, Fellow, IEEE Abstract—In this work, a framework is proposed for object recognition and pose estimation from color images using con- volutional neural networks (CNNs). 3D object pose estimation along with object recognition has numerous applications, such as robot positioning vs a target object and robotic object grasping. Previous methods addressing this problem relied on both color and depth (RGB-D) images to learn low-dimensional viewpoint descriptors for object pose retrieval. In the proposed method, a novel quaternion-based multi-objective loss function is used, which combines manifold learning and regression to learn 3D pose descriptors and direct 3D object pose estimation, using only color (RGB) images. The 3D object pose can then be obtained either by using the learned descriptors in a Nearest Neighbor (NN) search, or by direct neural network regression. An extensive experimental evaluation has proven that such descriptors provide greater pose estimation accuracy compared to state-of-the-art methods. In addition, the learned 3D pose descriptors are almost object-independent and, thus, generalizable to unseen objects. Finally, when the object identity is not of interest, the 3D object pose can be regressed directly from the network, by overriding the NN search, thus, significantly reducing the object pose inference time. Index Terms—3D object pose estimation, convolutional neural networks, multi-objective learning, object recognition, quater- nion. I. I NTRODUCTION O BJECT recognition and 3D pose estimation is a very challenging computer vision task. It has been heavily researched recently, due to its importance in robotics and augmented reality applications. However, there is still a large room for improvement, as occlusion, background clutter, scale and illumination variations highly affect object appearance, and, hence, reduce pose estimation accuracy. 3D object pose estimation typically derives object orienta- tion in a camera coordinate system (O c ,X c ,Y c ,Z c ), e.g. in a form of a quaternion q R 4 . The rotation R R 3×3 between the object coordinate system (O o ,X o ,Y o ,Z o ) and the camera coordinate system can be defined by a unit quaternion, as shown in Fig. 1. The 3D object pose estimation problem can be considered as a regression problem, if q is continuous over R 4 [1]–[3] or as a classification problem, if the 3D pose space has been quantized in a predefined number of orientation classes [4]–[6]. An alternative approach to 3D object pose estimation Christos Papaioannidis and Ioannis Pitas are with the Department of Informatics, Aristotle University of Thessaloniki, Thessaloniki 54124, Greece. e-mail: cpapaionn@csd.auth.gr, pitas@aiia.csd.auth.gr This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 731667 (MULTIDRONE). This publication reflects the authors’ views only. The European Commission is not responsible for any use that may be made of the information it contains. is transforming the 3D object pose regression problem into a nearest neighbor (NN) one, by matching hand-crafted [7] or extracted [8]–[11] image descriptors with a set of orientation class templates via NN search. It has to be mentioned that the 3D object pose estimation problem addressed by this work is a sub-case of 6D object pose estimation, where both the rotation R R 3×3 and translation T R 3 between the object coordinate system and the camera coordinate system are estimated. Also note that, in this paper, we focus on rigid object pose estimation, and articulated objects are not considered (e.g., human body) [12]. Both classification and regression are typical machine learn- ing problems. Deep learning and especially Convolutional Neural Networks (CNNs) [13] showed remarkable perfor- mance in such computer vision tasks, e.g. object detection [14]–[16], recognition [17], [18] and instance segmentation [19]. Deep CNNs were also successfully used for 6D object pose estimation [20]–[24], where both the 3D rotation and 3D translation of the object are estimated. CNNs usually require a huge amount of training data, and there is limited availability of object images annotated with their ground truth 3D pose, due to the inherent difficulty in estimating such a ground truth. However, 3D object models, if available, can be used to create large amounts of synthetic object images along with their ground truth poses for CNN training [8]–[11], [25]. In the proposed method, a lightweight CNN model is trained using both real and synthetic color object images. Since most pose estimation methods rely on deep network architectures and/or RGB-D data, our goal is to offer a lightweight and reliable RGB only-based 3D object pose estimation method, which can be utilized in embedded sys- tems. Inspired by [10], the proposed method utilizes siamese and triplet CNNs to calculate 3D object pose features. By combining manifold learning and regression, the CNN learns to produce pose features from which both the object identity and 3D pose can be inferred. However, in contrast to [10], the proposed CNN model is forced to learn features, whose distance in the feature space is proportional to the correspond- ing quaternion distance. To this end, a novel quaternion-based multi-objective loss function is proposed, which combines the strengths of both manifold learning and regression. The trained CNN model demonstrates state-of-the-art 3D object pose estimation accuracy along with object classification. In addition, the object identity and 3D pose are estimated in real time, hence, rendering it suitable for embedded computing in autonomous robotic systems, such as drones. In drone cinematography [26]–[34], 3D target (object) pose estimation is essential for autonomous navigation and visual drone control Copyright © 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org. Please cite the publisher maintained version in your work. DOI: 10.1109/TCSVT.2019.2929600