Cloud-Based Realtime Robotic Visual SLAM Patrick Benavidez, Mohan Muppidi, Paul Rad, John J. Prevost, Ph.D., and Mo Jamshidi, Ph.D.Lutcher Brown Endowed Chair Professor Autonomous Control Engineering Lab Department of Electrical and Computer Engineering University of Texas at San Antonio San Antonio USA patrick.benavidez@gmail.com, [mohan.muppidi, paul.rad, jeff.prevost]@utsa.edu, moj@wacong.org Abstract—Prior work has shown that Visual SLAM (VSLAM) algorithms can successfully be used for realtime processing on local robots. As the data processing requirements increase, due to image size or robot velocity constraints, local processing may no longer be practical. Offloading the VSLAM processing to systems running in a cloud deployment of Robot Operating System (ROS) is proposed as a method for managing increasing processing constraints. The traditional bottleneck with VSLAM performing feature identification and matching across a large database. In this paper, we present a system and algorithms to reduce computational time and storage requirements for feature identification and matching components of VSLAM by offloading the processing to a cloud comprised of a cluster of compute nodes. We compare this new approach to our prior approach where only the local resources of the robot were used, and examine the increase in throughput made possible with this new processing architecture. Keywords—cloud, cooperative VSLAM, indoor robot, VSLAM I. INTRODUCTION There are many approaches to robot navigation. Global Positioning System (GPS) is most often the approach of choice when the robot is operating in a theatre where a GPS signal is present. Often times, however, using GPS for robot navigation and localization is not possible due to lack of signal. This occurs when the robot is operating inside a structure, or building, that blocks the reception of the GPS signals. In these situations, algorithms such as video based Simultaneous Localization And Mapping (VSLAM) can allow robots to track and keep local maps of their relative positions within their environment. VSLAM works by using a camera mounted on the robot to periodically take pictures of their immediate surroundings and extracting key features from the images. The robot can determine where it is in the local environment by comparing features to a database of images taken of the environment during prior passes by the robot. There are many algorithms such as Scale Invariant Feature Tracker (SIFT) [1], Speeded-Up Robust Features (SURF) [2], Features from Accelerated Segment Test (FAST) [3], and Oriented FAST and Rotated BRIEF (ORB) [4] that are typically used for feature keypoints detection. Each of these algorithms are capable of detecting multiple robust features for use in VSLAM. Typically, VSLAM using these feature detection algorithms require storage of hundreds, or possibly thousands, of images to be able to properly ascertain the location of a robot in a local environment. This creates processing difficulties for the robot because the key features must be extracted then compared with the images in the database in a realtime operation. The rest of this paper is organized as follows. Section II presents our survey of existing approaches. Section III covers the proposed algorithm and computing model. Section IV presents comparative results for our algorithm versus existing well-known algorithms. Conclusions are presented in Section V. II. BACKGROUND The processing speed was examined in prior work by the authors of this research [5] and a mechanism for adequately processing by limiting the image size at runtime was developed. The effect of the processing algorithms proposed was experimentally tested and shown to allow for proper image recognition within realtime operation constraints. Fuentes-Pacheco et al. [6] discussed the problem of dynamic environments in SLAM. The authors mentioned the importance of having reliable algorithms with appreciable performance under conditions like variable light, occlusions, featureless regions and other unforeseen situation. Intensity variations are problematic indoors as there are many sources of light that can project uneven light intensities on the captured scene. If a feature detector cannot work well under varying lighting conditions then mismatches will occur and the resulting pose error will be high.