MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices Sangbum Choi Seokeon Choi Changick Kim Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea {sangbumchoi, seokeon, changick}@kaist.ac.kr Abstract Currently, 3D pose estimation methods are not compat- ible with a variety of low computational power devices be- cause of efﬁciency and accuracy. In this paper, we revisit a pose estimation architecture from a viewpoint of both efﬁ- ciency and accuracy. We propose a mobile-friendly model, MobileHumanPose, for real-time 3D human pose estima- tion from a single RGB image. This model consists of the modiﬁed MobileNetV2 backbone, a parametric activation function, and the skip concatenation inspired by U-Net. Es- pecially, the skip concatenation structure improves accu- racy by propagating richer features with negligible compu- tational power. Our model achieves not only comparable performance to the state-of-the-art models but also has a seven times smaller model size compared to the ResNet-50 based model. In addition, our extra small model reduces in- ference time by 12.2ms on Galaxy S20 CPU, which is suit- able for real-time 3D human pose estimation in mobile ap- plications. The source code is available at: https:// github.com/SangbumChoi/MobileHumanPose. 1. Introduction Due to the rapid development of deep convolutional neural networks and heatmap representation, 3D human pose estimation has signiﬁcant performance improvement. This improvement helps to unlock many problems of widespread applications in human-computer interaction, robotics, surveillance, AR (augmented reality), and VR (virtual reality). In particular, Mobile Augmented Real- ity (MAR) has recently attracted much interest in both academia and industry. Therefore, constructing a 3D hu- man pose estimation model with the restricted computa- tional power is an important task. However, the perfor- mance gain of a deep learning-based model comes with a wider channel size and deeper convolution layer [44]. This leads to an increment of computing cost, which is not suit- able for resource-limited devices such as smartphones. Figure 1. The difference between residual and skip con- catenation structures. The residual concatenation is im- plemented between adjacent blocks with down/up-sampled features. In contrast, skip concatenation is a pure concate- nation between the encoder and decoder with the same di- mension. Unfortunately, only two papers [10, 18] have dealt with the issue of model efﬁciency in various 3D human pose es- timation papers. However, both methods have signiﬁcant drawbacks with the following reasons: (a) Although differ- ential architecture search (DARTS) [10] might effectively search the network architecture of 3D human pose estima- tion, the number of parameters and computational costs are