Learning to Fuse: A Deep Learning Approach to Visual-Inertial Camera Pose Estimation Jason R. Rambach * German Research Center for Artiﬁcial Intelligence(DFKI) Augmented Vision Department, Kaiserslautern, Germany. Aditya Tewari † German Research Center for Artiﬁcial Intelligence(DFKI) Augmented Vision Department, Kaiserslautern, Germany. IEE S.A., Contern, Luxembourg. Alain Pagani ‡ German Research Center for Artiﬁcial Intelligence(DFKI) Augmented Vision Department, Kaiserslautern, Germany. Didier Stricker § German Research Center for Artiﬁcial Intelligence(DFKI) Augmented Vision Department, Kaiserslautern, Germany. TU Kaiserslautern, Germany. ABSTRACT Camera pose estimation is the cornerstone of Augmented Reality applications. Pose tracking based on camera images exclusively has been shown to be sensitive to motion blur, occlusions, and illumina- tion changes. Thus, a lot of work has been conducted over the last years on visual-inertial pose tracking using acceleration and angu- lar velocity measurements from inertial sensors in order to improve the visual tracking. Most proposed systems use statistical ﬁlter- ing techniques to approach the sensor fusion problem, that require complex system modelling and calibrations in order to perform ad- equately. In this work we present a novel approach to sensor fusion using a deep learning method to learn the relation between camera poses and inertial sensor measurements. A long short-term memory model (LSTM) is trained to provide an estimate of the current pose based on previous poses and inertial measurements. This estimate is then appropriately combined with the output of a visual track- ing system using a linear Kalman Filter to provide a robust ﬁnal pose estimate. Our experimental results conﬁrm the applicability and tracking performance improvement gained from the proposed sensor fusion system. Index Terms: I.4.8 [Scene Analysis]: Sensor fusion—tracking; I.2.10 [Vision and Scene Understanding]: Motion—Modeling and recovery of physical attributes; I.2.6 [Artiﬁcial Intelligence]: Learning—Connectionism and neural nets 1 I NTRODUCTION Accurate camera pose tracking is a core enabling technology for Augmented Reality (AR) applications using handheld or wearable devices [1]. Precise estimation of the camera’s six Degree of Free- dom pose (6DoF) consisting of camera position and orientation al- lows realistic rendering of virtual objects in the observed scene [2]. Vision-based systems for tracking using markers or natural fea- tures generally perform well in scenarios with slow camera motion [3, 4, 5]. However, in situations where the image quality is com- promised, for example during fast camera movements that cause blurring or during sudden illumination changes, pure visual track- ing systems tend to fail. On the other hand, pose tracking us- * e-mail: Jason Raphael.Rambach@dfki.de † e-mail:Aditya.Tewari@dfki.de ‡ e-mail:Alain.Pagani@dfki.de § e-mail:Didier.Stricker@dfki.de ing inertial sensors (accelerometers and gyroscopes) is more suit- able for following fast motion since the sensors can operate at a much higher frequency, but usually provide biased measurements with high noise levels. For this reason, there has been a lot of re- search on sensor fusion pose tracking systems attempting to com- bine measurements from visual trackers and inertial sensors in order to achieve more robust tracking [6, 7, 8, 9, 10, 11]. Commonly, statistical ﬁltering approaches such as the Extended Kalman Filter (EKF), Unscented Kalman Filter (UKF) or Particle Filters (PF) are used in sensor fusion systems. A tightly coupled fusion system that processes measurements from the visual and in- ertial sensors in an EKF framework is proposed in [12]. In their work, four different previously proposed system models for fusion are compared, with some of them using only the gyroscope and others using both inertial sensors. The system model treating the inertial measurements (acceleration and angular velocity) as con- trol inputs to the time update of the EKF is shown to achieve the best performance considering tracking accuracy and computational overhead. A simultaneous motion and structure estimation system using sensor fusion is given in [8]. Both the EKF and the UKF were used, showing similar tracking accuracy with the EKF being much faster in computation time. A visual tracking marker-based system where inertial tracking is deployed as a substitute only when the visual target is occluded is given in [13]. Another loosely coupled fusion approach is presented in [11] and applied to visual-inertial fusion in smartphones. An adaptive Kalman ﬁlter with abrupt error detection is used to fuse the output of an inertial and a visual tracker, however only the case of tracking a planar 2D target is considered. Another apporach is to use the inertial tracking only to provide guidance to the visual tracking system as to where tracked features are expected to be detected [14]. In more recent work, the integration of inertial measurements is done by solving an optimization problem or vari- ations of the EKF [15, 16]. These advanced methods still employ a parametric inertial sensor error model of bias and Gaussian noise. Adding inertial measurements in a visual tracking framework is a task that requires a lot of preparatory high precision work. A hand- eye calibration consisting of a rotation and a translation between the camera and the inertial sensors has to be computed in order to be able to bring the measurements from the visual and inertial sensors to the same reference coordinate system [17, 18, 19]. This calibra- tion was added to the ﬁltering framework state in [20]. Thus, self- calibration between camera and inertial sensor is performed dur- ing operation of the tracking system by adding however additional complexity to the ﬁltering. Another calibration from the inertial measurement unit coordinate system to the global coordinate sys- tem has to be computed in order to be able to remove gravity from