Geometric Pose Aﬀordance: Monocular 3D Human Pose Estimation with Scene Constraints Zhe Wang a , Liyan Chen a , Shaurya Rathore a , Daeyun Shin a , Charless Fowlkes a a Department of Computer Science, University of California, Irvine, CA 92617, USA ABSTRACT Accurate estimation of 3D human pose from a single image remains a challenging task despite many recent advances. In this paper, we explore the hypothesis that strong prior information about scene geometry can be used to improve pose estimation accuracy. To tackle this question empirically, we have assembled a novel Geometric Pose Aﬀordance dataset, consisting of multi-view imagery of people interacting with a variety of rich 3D environments. We utilized a commercial motion capture system to collect gold-standard estimates of pose and construct accurate geometric 3D models of the scene geometry. To inject prior knowledge of scene constraints into existing frameworks for pose estimation from images, we introduce a view-based representation of scene geometry, a multi-layer depth map, which employs multi-hit ray tracing to concisely encode multiple surface entry and exit points along each camera view ray direction. We propose two diﬀerent mechanisms for integrating multi-layer depth information into pose estimation: input as encoded ray features used in lifting 2D pose to full 3D, and secondly as a diﬀerentiable loss that encourages learned models to favor geometrically consistent pose estimates. We show experimentally that these techniques can improve the accuracy of 3D pose estimates, particularly in the presence of occlusion and complex scene geometry. 1. Introduction Accurate estimation of human pose in 3D from image data would enable a wide range of interesting applications in emerg- ing ﬁelds such as virtual and augmented reality, humanoid robotics, workplace safety, and monitoring mobility and fall prevention in aging populations. Interestingly, many such ap- plications are set in relatively controlled environments (e.g., the home) where large parts of the scene geometry are relatively static (e.g., walls, doors, heavy furniture). We are interested in the following question, “Can strong knowledge of scene geom- etry improve our estimates of human pose from images?”. Consider the images in Fig. 1 a. Intuitively, if we know the 3D locations of surfaces in the scene, this should constrain our estimates of pose. Hands and feet should not interpenetrate scene surfaces, and if we see someone sitting on a surface of known height we should have a good estimate of where their hips are even if large parts of the body are occluded. This gen- eral notion of scene aﬀordance 1 has been explored as a tool for understanding functional and geometric properties of a scene (Gupta et al., 2011; Fouhey et al., 2012; Wang et al., 2017; Li et al., 2019). However, the focus of such work has largely been on using estimated human pose to infer scene geometry and function. Surprisingly, there has been little demonstration of how scene 1 “The meaning or value of a thing consists of what it aﬀords.” -JJ Gibson (1979) arXiv:1905.07718v2 [cs.CV] 9 Dec 2021