0018-9545 (c) 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TVT.2019.2954876, IEEE Transactions on Vehicular Technology 1 Abstract—The technology for simultaneous localization and mapping (SLAM) has been well investigated with the rising interest in autonomous driving. Visual odometry (VO) is a variation of SLAM without global consistency for estimating the position and orientation of the moving object through analyzing the image sequences captured by associated cameras. However, in the real-world applications, we are inevitably to experience drift error problem in the VO process due to the frame-by-frame pose estimation. The drift can be more severe for monocular VO compared with stereo matching. By jointly refining the camera poses via several local keyframes and the coordinate of 3D map points triangulated from extracted features, bundle adjustment (BA) can mitigate the drift error problem only to some extent. To further improve the performance, we introduce a traffic sign feature-based joint BA module to eliminate and relieve the incrementally accumulated pose errors. The continuously extracted traffic sign feature with standard size and planar information will provide powerful additional constraints for improving the VO estimation accuracy through BA. Our framework can collaborate well with existing VO systems, e.g., ORB- SLAM2, and the traffic sign feature can also be replaced with feature extracted from other size-known planar objects. Experimental results by applying our traffic sign feature-based BA module show an improved vehicular localization accuracy compared with the state-of-the-art baseline VO method. Index Terms—Monocular Visual Odometry (VO), Bundle Adjustment (BA), traffic sign, ORB-SLAM2. I. INTRODUCTION N autonomous driving, it is essential to know where the vehicle itself is and to perceive the surrounding area. Benefiting from the rapid development of networking and Manuscript received June 30, 2019; revised September 12, 2019; accepted November 14, 2019. This work was supported in part by the China Scholarship Council. Copyright (c) 2019 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org. Yanting Zhang is with the School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876 (e-mail: zhangyt@bupt.edu.cn). She was a visiting Ph.D. student with communication technologies, the big-data information exchange among vehicles is getting feasible [1], [2], [3]. However, the precondition of the information exchange is that the vehicle itself has a better understanding about its condition, especially, a precise self-localization, since purely relying on the Global Positioning System (GPS) is not sufficient for autonomous scenario any more [4]. The real application for autonomous driving requires a much more precise measure or an alternative localization system. The concept of simultaneous localization and mapping (SLAM) is proposed to provide such a solution [5], [6]. It offers an accurate localization which is the most critical step for path planning and motion controlling. Moreover, the estimated pose estimation of the cameras can further provide useful camera extrinsic parameters (rotation and translation) for several 3D localization and tracking of visually detected objects. Different techniques and sensors can be used or combined to achieve a better localization in SLAM [7], [8], [9]. Visual cameras (monocular, stereo, and depth cameras), lidar scans, and inertial measurement units (IMUs), with detection and tracking techniques [10], can be integrated together into a unified system. In this paper, we focus on monocular vision based vehicular localization since the monocular cameras are more common and much more affordable for real-world applications. Moreover, facilitated by some deep learning based techniques, navigation and obstacle avoidance can also be effectively performed only through visual information [11]. Visual odometry (VO) describes the process of estimating the ego-motion of a vehicle based on a pair of consecutive image frames captured by the attached cameras [12], [13]. Besides autonomous driving, VO is also an important and fundamental component in numerous emerging technologies, such as robotic navigation, virtual/augmented reality and so on [14]. It analyzes the captured image sequences frame-by-frame to give back the camera pose parameters in each frame, including rotation and translation information. As a representative work of feature-based method, ORB-SLAM2 the Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, from 2018/09 to 2019/09. Jie Yang is with the School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876 (e-mail: janeyang@bupt.edu.cn). Haotian Zhang, Gaoang Wang, and Jenq-Neng Hwang are with the Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195 (e-mail: {haotiz, gaoang, hwang}@uw.edu). BUNDLE ADJUSTMENT FOR MONOCULAR VISUAL ODOMETRY BASED ON DETECTIONS OF TRAFFIC SIGNS Yanting Zhang, Student Member, IEEE, Haotian Zhang, Student Member, IEEE, Gaoang Wang, Student Member, IEEE, Jie Yang, Member, IEEE, and Jenq-Neng Hwang, Fellow, IEEE I