The Battle for Filter Supremacy: A Comparative Study of the Multi-State Constraint Kalman Filter and the Sliding Window Filter Lee E Clement, Valentin Peretroukhin, Jacob Lambert, and Jonathan Kelly Abstract— Accurate and consistent egomotion estimation is a critical component of autonomous navigation. For this task, the combination of visual and inertial sensors is an inexpensive, compact, and complementary hardware suite that can be used on many types of vehicles. In this work, we compare two modern approaches to egomotion estimation: the Multi-State Constraint Kalman Filter (MSCKF) and the Sliding Window Filter (SWF). Both filters use an Inertial Measurement Unit (IMU) to estimate the motion of a vehicle and then correct this estimate with observations of salient features from a monocular camera. While the SWF estimates feature positions as part of the filter state itself, the MSCKF optimizes feature positions in a separate procedure without including them in the filter state. We present experimental characterizations and comparisons of the MSCKF and SWF on data from a moving hand-held sensor rig, as well as several traverses from the KITTI dataset. In particular, we compare the accuracy and consistency of the two filters, and analyze the effect of feature track length and feature density on the performance of each filter. In general, our results show the SWF to be more accurate and less sensitive to tuning parameters than the MSCKF. However, the MSCKF is computationally cheaper, has good consistency properties, and improves in accuracy as more features are tracked. I. I NTRODUCTION The combination of visual and inertial sensors is a pow- erful tool for autonomous navigation in unknown envi- ronments. Indeed, cameras and inertial measurement units (IMUs) are complementary in several respects. Since an IMU measures linear accelerations and rotational velocities, these values must be integrated to arrive at a new pose estimate. However, the noise inherent in the IMU’s measurements is included in the integration as well, and consequently the pose estimates can drift unbounded over time. The addition of a camera is an excellent way to bound this cumulative drift error because the camera’s signal-to-noise ratio is highest when the camera is moving slowly. On the other hand, cameras are not robust to motion blur induced by rapid motions. In these cases, IMU data can be relied upon more heavily when estimating egomotion. The question, then, is how best to fuse measurements from these two sensor types to arrive at an accurate estimate of a vehicle’s motion over time. This problem is often complicated by the absence of a known map of features from which the camera can generate measurements. Any solution must therefore solve a Simultaneous Localization and Mapping (SLAM) problem, although the importance placed on mapping may vary from algorithm to algorithm. Lee E Clement and Valentin Peretroukhin jointly assert first authorship. All authors are at the Institute for Aerospace Studies, University of Toronto, Canada {lee.clement, v.peretroukhin, jacob.lambert} @mail.utoronto.ca, jkelly@utias.utoronto.ca Fig. 1. The hand-held sensor head used in our experiments with the “Starry Night” dataset. The IMU reports translational and rotational velocities, while the stereo camera observes point features. Since we are comparing monocular algorithms, we used measurements from the left camera of the stereo pair only. In this work we characterize, compare, and contrast the performance of two modern solutions to the visual-inertial SLAM problem, the Sliding Window Filter (SWF) and the Multi-State Constraint Kalman Filter (MSCKF) [1], [2], on data from a moving hand-held sensor rig, as well as several traverses from the KITTI dataset [3]. The most similar work to ours is that of Leutenegger et al. [4], which compares the accuracy of the MSCKF to a keyframe-based SWF on datasets consisting of relatively planar motion through urban and indoor environments. In contrast to [4], our SWF optimizes over a constant number of timesteps rather than keyframes. We also conduct a more extensive characteri- zation of the sensitivity of the MSCKF to certain tuning parameters and compare both algorithms using data from a hand-held sensor rig that mimics the more arbitrary motion of a micro aerial vehicle (MAV). II. BACKGROUND Visual-inertial navigation systems (VINS) have been ap- plied broadly in robotics [2], [5]–[10], and there is a con- siderable body of work covering a wide range of esti- mation algorithms for the camera-IMU sensor pair. These techniques are often characterized as either loosely coupled or tightly coupled. In loosely coupled systems, image and IMU measurements are processed individually before being fused into a single estimate, while tightly coupled systems process all information together. The decoupling of inertial and visual measurements in loosely coupled systems limits computational complexity [4], but at the cost of information: