Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation Leonid Sigal Michael J. Black Department of Computer Science, Brown University, Providence, RI 02912 {ls,black}@cs.brown.edu Abstract Part-based tree-structured models have been widely used for 2D articulated human pose-estimation. These ap- proaches admit efficient inference algorithms while captur- ing the important kinematic constraints of the human body as a graphical model. These methods often fail however when multiple body parts fit the same image region result- ing in global pose estimates that poorly explain the over- all image evidence. Attempts to solve this problem have focused on the use of strong prior models that are lim- ited to learned activities such as walking. We argue that the problem actually lies with the image observations and not with the prior. In particular, image evidence for each body part is estimated independently of other parts with- out regard to self-occlusion. To address this we introduce occlusion-sensitive local likelihoods that approximate the global image likelihood using per-pixel hidden binary vari- ables that encode the occlusion relationships between parts. This occlusion reasoning introduces interactions between non-adjacent body parts creating loops in the underlying graphical model. We deal with this using an extension of an approximate belief propagation algorithm (PAMPAS). The algorithm recovers the real-valued 2D pose of the body in the presence of occlusions, does not require strong priors over body pose and does a quantitatively better job of ex- plaining image evidence than previous methods. 1. Introduction Recent approaches to articulated human body detection and pose estimation exploit part-based tree-structured mod- els [3, 5, 8, 13, 15, 17] that capture kinematic relations be- tween body parts. In such models a body part is represented as a node in a graph and edges between nodes represent the kinematic constraints between connected parts. These mod- els are attractive because they allow local estimates of limb pose to be combined into globally consistent body poses. While this distributed computation admits efficient infer- ence methods, the local nature of the inference itself is also the Achilles heal of these methods. The image evidence Figure 1. Silly Walks. The detection of 2D body pose in real im- ages is challenging due to complex background appearance, loose monochromatic clothing, and the sometimes unexpected nature of human motion. In this scene, strong, activity-dependent, prior models of human pose are too restrictive. The result here was found by our method which makes weak assumptions about body pose but uses a new occlusion-sensitive image likelihood. for each part is estimated independently of the other parts and, without a global measure of the image likelihood of a body pose, multiple body parts can, and often do, explain the same image data. In particular, for 2D body pose estimation, the “wrong” solutions are often more likely than the “true” solution. Fig- ure 2 illustrates the problem that results when local image likelihood measures for each body part do not take into ac- count the poses of other parts and do not exploit any knowl- edge of what image evidence is left unexplained. This prob- lem is not unique to human pose estimation and applies in other generic object-recognition problems. Recent attempts to solve the problems illustrated in Fig- ure 2 have focused on the use of strong prior models of body pose that rule out unlikely poses [8]. These approaches are not appropriate for dealing with unexpected or unusual mo- tions such as those in Figure 1. In particular, they require that we already know the activity being observed and that the variation in the pose is within learned limits. Other computational strategies incrementally explore the space of body poses but give up the formal probabilistic interpreta- tion of graphical models [13]. In this paper we argue that such approaches are fighting the wrong image likelihood