Structure-aware 3D Hand Pose Regression from a Single Depth Image Jameel Malik 1,2 , Ahmed Elhayek 1 , and Didier Stricker 1 1 Department Augmented Vision, DFKI Kaiserslautern, Germany 2 NUST-SEECS, Pakistan {jameel.malik,ahmed.elhayek,didier.stricker}@dfki.de Abstract. Hand pose tracking in 3D is an essential task for many virtual real- ity (VR) applications such as games and manipulating virtual objects with bare hands. CNN-based learning methods achieve the state-of-the-art accuracy by di- rectly regressing 3D pose from a single depth image. However, the 3D pose esti- mated by these methods is coarse and kinematically unstable due to independent learning of sparse joint positions. In this paper, we propose a novel structure- aware CNN-based algorithm which learns to automatically segment the hand from a raw depth image and estimate 3D hand pose jointly with new structural constraints. The constraints include fingers lengths, distances of joints along the kinematic chain and fingers inter-distances. Learning these constraints help to maintain a structural relation between the estimated joint keypoints. Also, we convert sparse representation of hand skeleton to dense by performing n-points interpolation between the pairs of parent and child joints. By comprehensive eval- uation, we show the effectiveness of our approach and demonstrate competitive performance to the state-of-the-art methods on the public NYU hand pose dataset. Keywords: Hand pose · Depth image · Convolutional Neural Network (CNN). 1 Introduction Markerless 3D hand pose estimation is a fundamental challenge for many interesting applications of virtual reality (VR) and augmented reality (AR) such as handling of ob- jects in VR environment, games and interactive control. This task has been extensively studied in the past few years and great progress has been achieved. This is primarily due to the arrival of low cost depth sensors and rapid advancements in deep learning. However, estimating 3D hand pose from a single depth image is still challenging due to self similarities, occlusions, wide range of articulations and varying hand shapes. Hand pose estimation methods are classified into three main catagories namely learning based methods (discriminative), model-based methods (generative) and com- bination of the discriminative and generative methods (hybrid). Among these meth- ods, CNN-based discriminative methods have shown the highest accuracy on the public benchmarks. Despite of the fact that these methods achieve higher accuracy, they do not well exploit the structural information of hands during the learning process [35, 34, 11]. Specifically, independent learning of sparse joint positions with no consideration to joint connection structure and hand skeleton constraints leads to coarse predictions.