Trajectory Servoing: Image-Based Trajectory Tracking Using SLAM Shiyu Feng 1,† , Zixuan Wu 2,† , Yipu Zhao 3 and Patricio A. Vela 2 Abstract— This paper describes an image based visual ser- voing (IBVS) system for a nonholonomic robot to achieve good trajectory following without real-time robot pose information and without a known visual map of the environment. We call it trajectory servoing. The critical component is a feature- based, indirect SLAM method to provide a pool of available features with estimated depth, so that they may be propagated forward in time to generate image feature trajectories for visual servoing. Short and long distance experiments show the benefits of trajectory servoing for navigating unknown areas without absolute positioning. Trajectory servoing is shown to be more accurate than pose-based feedback when both rely on the same underlying SLAM system. I. I NTRODUCTION Navigation systems with real-time needs often employ hierarchical schemes that decompose navigation across mul- tiple spatial and temporal scales. Doing so permits the navi- gation solution to respond in real-time to novel information gained from sensors, while being guided by the more slowly evolving global path. At the lowest level of the hierarchy lies trajectory tracking to realize the planned paths or synthesized trajectories. In the absence of an absolute reference (such as GPS) and of an accurate map of the environment, there are no external mechanisms to support trajectory-tracking. On- board mechanisms include odometry through proprioceptive sensors (wheel encoders, IMUs, etc.) or visual sensors. Pose estimation from proprioceptive sensors is not observable, thus visual sensors provide the best mechanism to anchor the robot’s pose estimate to external, static position references. Indeed visual odometry (VO) or visual SLAM (V-SLAM) solutions are essential in these circumstances. However, they too experience drift, mostly due to the integrated effects of measurement noise. Is it possible to do better? The hypothesis explored in this paper is that perform- ing trajectory tracking in the image domain reduces the sensitivity of trajectory tracking systems reliant on VO or V-SLAM for accuracy. In essence, the trajectory tracking problem is shifted from feedback in pose space to feedback in perception space. Perception space approaches have sev- eral favorable properties when used for navigation [1], [2]. Shifting the representation from being world-centric to being *This work supported in part by NSF Award #1849333. † Equal contribution 1 S. Feng is with the School of Mechanical Engineering and the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30308, USA. shiyufeng@gatech.edu 2 Z. Wu and P.A. Vela are with the School of Electrical and Com- puter Engineering and the Institute for Robotics and Intelligent Machines, Georgia Institute of Technology, Atlanta, GA 30308, USA. {zwu380, pvela}@gatech.edu 3 Y. Zhao is with Facebook Reality Labs Research, Redmond, USA. {yipuz}@fb.com viewer-centric reduces computational demands and improves run-time properties. For trajectory tracking without reliable absolute pose information, simplifying the feedback pathway by skipping processes that are not relevant to–or induce sensitivities to–the local tracking task may have positive benefits. Using imaging sensors to control motion relative to visual landmarks is known as visual servoing. Thus, the objective is to explore the use of image-based visual servoing for long-distance trajectory tracking with a stereo camera as the primary sensor. The technique, which we call trajectory servoing, will be shown to have improved properties over systems reliant on VO or V-SLAM for pose-based feedback. A. Related Work 1) Visual Teach and Repeat: Evidence that visual fea- tures can support trajectory tracking or consistent naviga- tion through space lies in the Visual Teach and Repeat (VTR) navigation problem in robotics [3], [4]. Given data or recordings of prior paths through an environment, robots can reliably retrace past trajectories. The teaching phase of VTR creates a visual map that contains features associated with robot poses obtained from visual odometry [3], [5]–[8]. Extensions include real-time construction of the VTR data structure during the teaching process, and the maintenance and updating of the VTR data during repeat runs [5], [6]. Feature descriptor improvements make the feature matching more robust to the environment changes [8], [9]. Visual data in the form of feature points can have task relevant and irrele- vant features, which provide VTR algorithms an opportunity to select a subset that best contributes to the localization or path following task [5], [7]. While visual map construction seems similar to visual SLAM, map construction is usually not dynamic; it is difficult to construct or update visual map in real-time while in motion because of the separation of the teach and repeat phases. In addition, VTR focuses more on local map consistency and does not work toward global pose estimation [7] since the navigation problems it solves are usually defined in the local frame. Another type of VTR uses the optical flow [4], [10] or feature sequence [11]–[13] along the trajectory, which is then encoded into a VTR data structure and control algorithm in the teaching phase. Although this method is similar to visual servoing, the system is largely over-determined. It can tolerate feature tracking failure, compared with traditional visual servo system, but may lead to discontinuities [14]. Though this method handles long trajectories, and may be be supplemented from new teach recordings, it can only track taught trajectories. arXiv:2103.00055v2 [cs.RO] 6 Mar 2021