IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 A Probabilistic Framework for 3D Visual Object Representation Renaud Detry Nicolas Pugeault Justus Piater Abstract—We present an object representation framework that encodes probabilistic spatial relations between 3D features and organizes these features in a hierarchy. Features at the bottom of the hierarchy are bound to local 3D descriptors. Higher-level features recursively encode probabilistic spatial conﬁgurations of more elementary features. The hierarchy is implemented in a Markov network. Detection is carried out by a belief propagation algorithm, which infers the pose of high-level features from local evidence and reinforces local evidence from globally consistent knowledge, effectively producing a likelihood for the pose of the object in the detection scene. We also present a simple learning algorithm that autonomously builds hierarchies from local object descriptors. We explain how to use our framework to estimate the pose of a known object in an unknown scene. Experiments demonstrate the robustness of hierarchies to input noise, viewpoint changes and occlusions. Index Terms—Computer vision, 3D object representation, pose estimation, Nonparametric Belief Propagation. ✦ 1 I NTRODUCTION T HE merits of part-based and hierarchical approaches to object modeling have often been put forward in the vision community [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. Part-based models typically separate structure from appearance, which allows them to deal with vari- ability separately in each modality. A hierarchy of parts takes this idea further, by introducing scale-dependent variability: small part conﬁgurations can be tightly con- strained, while wider associations can allow for more variability. Furthermore, part-based models do not only allow for the detection and localization of an object, but also parsing of its constituent parts. They lend themselves to part sharing and reuse, which should help in overcoming the problem of storage size and detection cost in large object databases. Finally, these models not only allow for bottom-up inference of object parameters based on features detected in images, but also for top- down inference of image-space appearance based on object parameters. A large body of the object modeling literature focuses on modeling the 2D projections of a 3D object. A major issue with this approach is that all variations intro- duced by projective geometry (geometrical transforma- tions, self-occlusions) have to be robustly captured and • R. Detry (Renaud.Detry@ULg.ac.be) and J. Piater are with the INTELSIG Laboratory, Univ. of Li` ege, Belgium. • N. Pugeault is with the Cognitive Vision Lab, The Maersk Mc-Kinney Moller Inst., Univ. of Southern Denmark, Denmark. Manuscript received 16 Aug. 2008; revised 18 Dec. 2008; accepted 5 Mar. 2009; published online 17 Mar. 2009. Recommended for acceptance by Q. Ji, A. Torralba, T. Huang, E. Sudderth, and J. Luo. This is an author postprint. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMISI-2008-08-0529. Digital Object Identiﬁer no. 10.1109/TPAMI.2009.64. handled by the model. In the past few years, modeling objects directly in 3D has become increasingly popular [11], [12], [13], [14], [15], [16]. The main advantage of these methods lies in their natural ability to handle projective transformations and self-occlusions. The main contribution of this paper is a framework that encodes the 3D geometry and visual appearance of an object into a part-based model, and mechanisms for autonomous learning and probabilistic inference of the model. Our representation combines local appearance and 3D spatial relationships through a hierarchy of increasingly expressive features. Features at the bottom of the hierarchy are bound to local 3D visual perceptions called observations. Features at other levels represent combinations of more elementary features, encoding probabilistic relative spatial relationships between their children. The top level of the hierarchy contains a single feature which represents the whole object. The hierarchy is implemented in a Markov random ﬁeld, where features correspond to hidden variables, and spatial relationships deﬁne pairwise potentials. To detect instances of a model in a scene, observational evidence is propagated throughout the hierarchy by probabilistic inference mechanisms, leading to one or more consistent scene interpretations. Thus, the model is able to suggest a number of likely poses for the object, a pose being composed of a 3D world location and a 3D world ori- entation. The inference process follows a nonparametric belief propagation scheme [17] which uses importance sampling for message products. The model is bound to no particular learning scheme. In this paper, we present an autonomous learning method that builds hierarchies in a bottom-up fashion. Learning and detection algorithms reason directly on sets of local 3D visual perceptions which we will re- fer to as (input) observations. These observations should represent visual input in terms of 3D descriptors, i.e. 0162–8828/09/$25.00 c  2009 IEEE Published by the IEEE Computer Society