The Impact of the Image Feature Detector and Descriptor Choice on Visual Odometry Accuracy Chunkai Yao 1 , Danish Syed 1 , Joseph Kim 2 , Peerayos Pongsachai 2 , and Teerachart Soratana 3 Abstract—Building a fully autonomous mobile robotic system is a difficult task, which requires accurate sensors and techniques for localization and mapping. One such techniques is Visual Odometry (VO), which uses stereo or monocular camera sensors to estimate the poses of a vehicle. The goal of this project was to implement VO and to analyze the accuracy of the localization when using different combinations of descriptor and detector. Simulation results were obtained by analyzing 10 KITTI dataset sequences, with Relative Pose Error (RPE), Absolute Trajectory Error (ATE) and runtime in term of frames per second (FPS) as evaluation metrics. Comparison of RPE and ATE for each combination of descriptor and detector were obtained as heatmap. Outliers in RPE and error accumulation in ATE were discussed, and future work were suggested for conclusive analysis as benchmark. Code of our report is publicly available at https://github.com/dysdsyd/VO benchmark. Index Terms—Visual Odometry, SLAM, Mobile robotics, Com- puter vision, Descriptors, Detectors, Convolution Neural Network I. I NTRODUCTION Significant progress has been achieved in the area of mobile robotics. Advancement in both hardware and software have made such system a safe and reliable technology. For such systems, making robots understand the world and recognize / distinguish what they see has become an important research area. Furthermore, localization and mapping have become a central technology in the perception and state estimation of mobile robotics. In this paper, we recapped relative techniques used in monocular visual odometry (VO) to determine the position and orientation of a vehicle with pre-existing KITTI dataset [1]. Then, we implemented and compared different feature descriptors to compare their performances using Rel- ative Pose Error (RPE), Absolute Trajectory Error (ATE) and runtime. The structure of this paper is organized as follows: section II discuss existing techniques in VO, and detector and descriptor algorithms. Section III discuss the details of techniques we implemented for performance evaluation. Section IV discuss about the results of the evaluation. Section V summarize the conclusions of the performance evaluation and discussed what we can improve in the future. 1 Chunkai Yao and Danish Syed are with Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA nickyao@umich.edu and dasyed@umich.edu 2 Joseph Kim and Peerayos Pongsachai are with Robotics Institute, University of Michigan, Ann Arbor, USA jthkim@umich.edu and ppongsa@umich.edu 3 Teerachart Soratana is with Department of Industrial and Operations Engi- neering, University of Michigan, Ann Arbor, USA tsorat@umich.edu II. BACKGROUND For a robot to autonomously track its path, and detect / avoid any obstacles, localization is the fundamental technique for the pose estimation. Below are the descriptions of VO, and of detection and descriptor algorithms for data acquisition and transformation estimation. A. Visual Odometry (VO) VO is a pose estimation on a moving agent with cam- era video inputs [2]. By analyzing sequence of images in camera, VO incrementally estimates the position (translation and rotation with respect to a reference frame) of a vehicle. The idea first came out in the early 1980s [3], and VO was extensively researched in NASA, preparing Mars Mission in 2004 [4], [5]. They found that VO is an inexpensive application with some of distinct advantages like functionality in GPS- denied environment, no slippage effect in uneven terrain, light- weight and simple to integrate with other computer vision based algorithms. Thus, VO has become an alternative to conventional localization method such as wheel odometry, GPS, INS, sonar localization, etc [5]. Particularly, VO is one of the robust techniques to do such localization, known to have a relative position error of 0.1 to 2% [5]. To provide the comparison, Table I below briefly summarizes pros and cons of different localization techniques. Furthermore, it is noteworthy to distinguish the difference between VO and Simultaneous Localization and Mapping (SLAM). In VO, primary objective is to estimate the pose of the vehicle by incrementally taking the camera input and by maintaining local consistency between the sequential frames. In contract, the objective of SLAM is to update the unknown map environment incrementally, and use this updated map to estimate the pose of vehicle in the newly acquired map data. In this sense, SLAM maintains globally consistent estimation of the trajectory. Figure 1 below illustrates the high-level overview of VO and SLAM. VO is broadly classified as two categories (i.e. stereo vision, monocular vision) based on the types of camera used in position estimation. In our interest of VO with image feature detectors, we restrict our scope to the monocular VO, but the generality is not lost. However, it is worth noting the main performance difference between stereo and monocular cameras used in VO. In stereo camera, depth information and image scale is relatively easy to obtain, but requires more calibration effort than monocular cameras, and also can be more costly and difficult in terms of interfacing. In contrast, monocular camera is low cost, low weight, and simpler to implement,