Probabilistic object models for pose estimation in 2D images Damien Teney 1 and Justus Piater 2 1 University of Li` ege, Belgium Damien.Teney@ULg.ac.at 2 University of Innsbruck, Austria Justus.Piater@UIBK.ac.at Abstract. We present a novel way of performing pose estimation of known objects in 2D images. We follow a probabilistic approach for modeling objects and representing the observations. These object mod- els are suited to various types of observable visual features, and are demonstrated here with edge segments. Even imperfect models, learned from single stereo views of objects, can be used to infer the maximum- likelihood pose of the object in a novel scene, using a Metropolis-Hastings MCMC algorithm, given a single, calibrated 2D view of the scene. The probabilistic approach does not require explicit model-to-scene corre- spondences, allowing the system to handle objects without individually- identiﬁable features. We demonstrate the suitability of these object mod- els to pose estimation in 2D images through qualitative and quantitative evaluations, as we show that the pose of textureless objects can be re- covered in scenes with clutter and occlusion. 1 Introduction Estimating the 3D pose of a known object in a scene has many applications in diﬀerent domains, such as robotic interaction and grasping [1,6,13], augmented reality [7,9,19] and the tracking of objects [11]. The observations of such a scene can sometimes be provided as a 3D reconstruction of the scene [4], e.g. through stereo vision [5]. However, in many scenarios, stereo reconstructions are unavail- able or unreliable, due to resource limitations or to imaging conditions such as a lack of scene texture. This paper addresses the use of a single, monocular image as the source of scene observations. Some methods in this context were proposed to make use of the appearance of the object as a whole [6,13,15]. These so-called appearance- based methods however suﬀer from the need of a large number of training views. The state-of-the-art methods in the domain rather rely on matching characteris- tic, local features between the observations of the scene and a stored, 3D model of the object [1,7,17]. This approach, although eﬃcient with textured objects or otherwise matchable features, would fail when considering non-textured objects, or visual features that cannot be as precisely located as the texture patches or geometric features used in the classical methods. Hsiao et al.’s method [8] seeks