3D Hand and Object Pose Estimation for Real-time Human-robot Interaction Chaitanya Bandi a , Hannes Kisner and Urike Thomas Robotics and Human-Machine Interaction Lab., Technical University of Chemnitz, Reichenhainer str.70, Chemnitz, Germany Keywords: Pose, Keypoints, Hand, Object. Abstract: Estimating 3D hand pose and object pose in real-time is essential for human-robot interaction scenarios like handover of objects. Particularly in handover scenarios, many challenges need to be faced such as mutual hand-object occlusions and the inference speed to enhance the reactiveness of robots. In this paper, we present an approach to estimate 3D hand pose and object pose in real-time using a low-cost consumer RGB-D camera for human-robot interaction scenarios. We propose a cascade of networks strategy to regress 2D and 3D pose features. The ﬁrst network detects the objects and hands in images. The second network is an end-to-end model with independent weights to regress 2D keypoints of hands joints and object corners, followed by a 3D wrist centric hand and object pose regression using a novel residual graph regression network and ﬁnally a perspective-n-point approach to solve 6D pose of detected objects in hand. To train and evaluate our model, we also propose a small-scale 3D hand pose dataset with a new semi-automated annotation approach using a robot arm and demonstrate the generalizability of our model on the state-of-the-art benchmarks. 1 INTRODUCTION Hand and object pose estimation is an active research ﬁeld for applications like robotics, augmented real- ity, and manipulation. 3D hand pose estimation and 6D object pose estimation have been addressed inde- pendently. Nevertheless, combined hand and object pose estimation enhances the mutual occlusions and this is yet to be solved for real-time applications. In this work, we aim to introduce a pipeline to estimate both hand pose and object pose for interaction sce- narios. In robotics applications like bidirectional han- dover of objects, reactiveness, reliability, and safety are highly signiﬁcant. This can be achieved by precise estimation of ﬁngertips and object pose in real-time. The state-of-the-art works rely heavily on deep learn- ing architectures for 3D hand pose estimation (Zim- mermann and Brox, 2017; Mueller et al., 2018; Iqbal et al., 2018; Ge et al., 2019), 6D object pose estima- tion (Tekin et al., 2017; Peng et al., 2018; Li et al., 2018; Wang et al., 2019; Park et al., 2019; Labb´ e et al., 2020; Tremblay et al., 2018), and uniﬁed hand and object pose estimation (Doosti et al., 2020; Has- son et al., 2019; Hasson et al., 2020; Tekin et al., 2019). The works (Yang et al., 2020; Rosenberger et al., 2020) propose unique solutions for applications like a https://orcid.org/0000-0001-7339-8425 the handover of objects. These works rely on seg- mentation networks to obtain the region of hands and objects and later forward the region to the respective grasp pose reﬁnement model. Although the segmen- tation networks are reliable, the inference speed is slow without proper hardware resources. In this paper, we present an approach to regress 3D hand pose using deep learning architecture and compute 6D object pose estimation as illustrated in Figure 1. The ﬁrst network is an independent object detection model to recognize the region of hands and objects. The second network consists of two different deep learning models that can either be trained end- to-end or independently to infer 2D hand pose, 3D hand pose, 2D object corners, 3D object corners and a perspective-n-point solver for 6D object pose. In this work, we introduce a two stream hourglass network for 2D pose and a novel network for 3D hand pose regression using graph convolutional net- works. To train deep learning architectures, datasets are highly signiﬁcant and there exist quite a few benchmarks for hand object pose estimation. Most of the benchmarks rely on a manual annotation process and it is quite tedious, time-consuming, and costly. We also introduce a new semi-automatic labeling pro- cess for 3D hand pose estimation. 770 Bandi, C., Kisner, H. and Thomas, U. 3D Hand and Object Pose Estimation for Real-time Human-robot Interaction. DOI: 10.5220/0010902400003124 In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 4: VISAPP, pages 770-780 ISBN: 978-989-758-555-5; ISSN: 2184-4321 Copyright c  2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved