Nonlinear Mean Shift for Robust Pose Estimation Raghav Subbarao †‡ Yakup Genc ‡ Peter Meer † † ECE Department ‡ Real-time Vision and Modeling Department Rutgers University Siemens Corporate Research Piscataway, NJ 08854 Princeton, NJ 08540 Abstract We propose a new robust estimator for camera pose esti- mation based on a recently developed nonlinear mean shift algorithm. This allows us to treat pose estimation as a clus- tering problem in the presence of outliers. We compare our method to RANSAC, which is the standard robust estima- tor for computer vision problems. We also show that under fairly general assumptions our method is provably better than RANSAC. Synthetic and real examples to support our claims are provided. 1. Introduction Real time estimation of camera pose is an important problem in computer vision. Pose estimation along with scene structure estimation is known as the Structure-From- Motion (SFM) problem which is the central goal of vision. It is widely accepted that once good estimates of the struc- ture and motion are known, they can be improved using of- ﬂine methods like bundle adjustment [19]. However, to get a starting point, a system needs to account for both noise and gross errors which do not satisfy the geometric con- straints being enforced. Such errors are known as outliers. Pose estimation is also a part of other applications such as augmented reality (AR). For AR only the pose of the camera is needed, although some structure may also be esti- mated. The pose is required in real time and ofﬂine methods such as bundle adjustment are not applicable here. Random Sample Consensus (RANSAC) and its varia- tions, which follow a hypothesise-and-test procedure, are the standard ways of handling outliers in SFM. In this paper we propose a new robust estimator for camera pose estima- tion. The estimator is based on the nonlinear mean shift al- gorithm of [15, 20] applied to the Special Euclidean Group which is the set of all rigid body motions in 3D and is equiv- alent to the set of all camera poses. We show theoretically and experimentally that our method requires fewer hypothe- ses than any hypothesise-and-test algorithm for the same level of performance. We discuss some of the previous work related to our ap- proach in Section 2. In Section 3 we introduce the nonlinear mean shift algorithm. In Section 4 we develop a robust pose estimator based on this algorithm and outline a proof of why we expect the mean shift based estimator to be better than RANSAC. Finally, in Section 5 we present the results of experiments on synthetic and real data sets. 2. Previous Work Classical methods reconstruct the scene using correspon- dences across images and estimating the epipolar geome- try between pairs of frames or the trifocal tensor for three frames. These reconstructions are then stitched together into a single frame [14]. The Euclidean equivalent of this is the relative pose estimation problem given image corre- spondences between two images [8]. Alternatively, the mo- tion and structure can be estimated in a single coordinate frame [12]. Such methods require absolute camera pose estimation based on correspondences between 3D world points and 2D image points [1, 6]. An important aspect of these algorithms is that whenever any geometrical constraint is being enforced, there will be outliers which do not satisfy the constraint. These outliers occur due to errors in lower level modules such as the image feature tracker. When estimating the motion and structure it is necessary to detect and remove these outliers. The standard way of handling outliers in computer vi- sion is the RANSAC algorithm [4]. In RANSAC, parameter hypotheses are generated by randomly choosing a minimal number of elements required to generate a hypothesis. The hypotheses are scored based on their likelihood to have gen- erated the observed data and the best hypothesis is retained. Based on the noise model assumed, different scoring func- tion have been proposed to develop variations of RANSAC [17, 18]. Another important contribution has been the develop- ment of preemptive forms of RANSAC [2, 10] which al- low RANSAC to be used in real-time SFM systems. In such methods, all the hypotheses’ are not scored com- pletely. Some hypotheses are preemptively dropped. Un- like RANSAC where a single hypothesis is generated and scored while only retaining the most likely hypothesis, pre- emptive RANSAC [10] proceeds by generating all the hy- potheses at the beginning. The likelihood of the hypotheses IEEE Workshop on Applications of Computer Vision (WACV'07) 0-7695-2794-9/07 $20.00 © 2007