Tracking an RGB-D Camera on Mobile Devices Using an Improved Frame-to-Frame Pose Estimation Method Jaepung An ∗ Jaehyun Lee † Jiman Jeong † Insung Ihm ∗ ∗ Department of Computer Science and Engineering Sogang University, Korea † TmaxOS Korea {ajp5050,ihm}@sogang.ac.kr {jaehyun lee,jiman jeong}@tmax.co.kr Abstract The simple frame-to-frame tracking used for dense vi- sual odometry is computationally efﬁcient, but regarded as rather numerically unstable, easily entailing a rapid accu- mulation of pose estimation errors. In this paper, we show that a cost-efﬁcient extension of the frame-to-frame tracking can signiﬁcantly improve the accuracy of estimated camera poses. In particular, we propose to use a multi-level pose error correction scheme in which the camera poses are re- estimated only when necessary against a few adaptively se- lected reference frames. Unlike the recent successful cam- era tracking methods that mostly rely on the extra comput- ing time and/or memory space for performing global pose optimization and/or keeping accumulated models, the ex- tended frame-to-frame tracking requires to keep only a few recent frames to improve the accuracy. Thus, the resulting visual odometry scheme is lightweight in terms of both time and space complexity, offering a compact implementation on mobile devices, which do not still have sufﬁcient com- puting power to run such complicated methods. 1. Introduction The effective estimation of 6-DOF camera poses and re- construction of a 3D world from a sequence of 2D images captured by a moving camera has a wide range of com- puter vision and graphics applications, including robotics, virtual and augmented reality, 3D games, and 3D scanning. With the availability of consumer-grade RGB-D cameras such as the Microsoft Kinect sensor, direct dense methods, which estimate the motion and shape parameters directly from raw image data, have recently attracted a great deal of research interest because of their real-time applicability in robust camera tracking and dense map reconstruction. The basic element of the direct dense visual localization and mapping is the optimization model which, derived from pixel-wise constraints, allows an estimate of rigid-body mo- tion between two time frames. For effective pose estima- tion, several different forms of error models to formulate a cost function were proposed independently in 2011. New- combe et al. [11] used only geometric information from input depth images to build an effective iterative closest point (ICP) model, while Steinbr¨ ucker et al. [14] and Au- dras et al. [1] minimized a cost function based on photo- metric error. Whereas, Tykk¨ al¨ a et al. [17] used both geo- metric and photometric information from the RGB-D im- age to build a bi-objective cost function. Since then, sev- eral variants of optimization models have been developed to improve the accuracy of pose estimation. Except for the KinectFusion method [11], the initial direct dense methods were applied to the framework of frame-to-frame tracking that estimates the camera poses by repeatedly registering the current frame against the last frame. While efﬁcient computationally, the frame-to-frame approach usually suf- fers from substantial drift due to the numerical instability caused mainly by the low precision of consumer-level RGB- D cameras. In particular, the errors and noises in their depth measurements are one of the main sources that hinder a sta- ble numerical solution of the pose estimation model. In order to develop a more stable pose estimation method, the KinectFusion system [11] adopted a frame- to-model tracking approach that registers every new depth measurement against an incrementally accumulated dense scene geometry, represented in a volumetric truncated signed distance ﬁeld. By using higher-quality depth im- ages that are extracted on the ﬂy from the fully up-to-date 3D geometry, it was shown that the drift of the camera can decrease markedly while constructing smooth dense 3D models in real-time using a highly parallel PC GPU imple- mentation. A variant of the frame-to-model tracking was presented by Keller et al. [9], in which aligned depth im- ages were incrementally fused into a surfel-based model, instead of a 3D volume grid, offering a relatively more efﬁcient memory implementation. While producing more accurate pose estimates than the frame-to-frame tracking, the frame-to-model tracking techniques must manipulate