CUDA Accelerated Robot Localization and Mapping Haiyang Zhang, Fred Martin Computer Science Department University of Massachusetts Lowell Lowell, MA 01854, USA hzhang@cs.uml.edu, fredm@cs.uml.edu Abstract — We present a method to accelerate robot localization and mapping by using CUDA (Compute Unified Device Architecture), the general purpose parallel computing platform on NVIDIA GPUs. In robotics, the particle filter-based SLAM (Simultaneous Localization and Mapping) algorithm has many applications, but is computationally intensive. Prior work has used CUDA to accelerate various robot applications, but particle filter-based SLAM has not been implemented on CUDA yet. Because computations on the particles are independent of each other in this algorithm, CUDA acceleration should be highly effective. We have implemented the SLAM algorithm’s most time consuming step, particle weight calculation, and optimized memory access by using texture memory to alleviate memory bottleneck and fully leverage the parallel processing power. Our experiments have shown the performance has increased by an order of magnitude or more. The results indicate that offloading to GPU is a cost-effective way to improve SLAM algorithm performance. Keywords — robot; localization; mapping; SLAM; GPU; GPGPU; parallel; CUDA I. INTRODUCTION For a mobile robot in unknown environments, it is important to simultaneously localize itself and generate maps of the environments. The SLAM (Simultaneous Localization and Mapping) algorithm [1, 2] is usually used in these cases. Based on a probabilistic model, the SLAM algorithm estimates the robot state from its prior state, the current motor commands, and sensor readings. Particle filter-based SLAM is easy to implement and applicable to non-linear and non- Gaussian systems. The particle filter is a sequential Monte Carlo method, in which system state is represented by a set of particles. Each particle is a data object containing one of the hypothetical robot states from the distribution and a “weight” value. In each sensing cycle, we calculate the weight value according to how closely this state matches the current sensor readings, and re- sample the particle set based on their weights. To maintain an accurate representation of the state distribution, we must have a large number of particles, which makes the particle filter computationally intensive. But most computing steps in the particle filter are done independently on each particle, so it is inherently suitable for parallel processing. CUDA (Compute Unified Device Architecture) [3] is a parallel computing platform running on NVIDIA GPUs (Graphics Processing Units). It is one of the popular GPGPU (General-Purpose computing on GPU) platforms. CUDA includes the compiler and driver to build and run CUDA C, which is an extended C/C++ language supporting both CPU and GPU, and communication between them. The present work extends prior work in this area. Here is a brief review of related research done to accelerate particle filters and other robot applications with CUDA. To efficiently utilize CUDA, Chao et al. describe an algorithm to implement a particle filter on CUDA [4]. Two enhancements are used—Finite-Redraw Importance-Maximiz- ing (FRIM) prior editing and localized resampling. FRIM prior editing increases the coverage of the particles to important region of the state distribution. And, localized resampling reduces the overhead to access global memory. They use bearings-only tracking (BOT) problem for the performance benchmarking. The optimizations have increased performance by 5.73 times than a direct implementation on a GPU. This paper shows CUDA can effectively accelerate particle filter used in BOT problem, where the resampling step is the slowest step. But, in our work on SLAM problem, we have found that the weight calculation is the most time consuming step, and focused on accelerating this part of the particle filter. Xu et al. present an implementation of the “saliency map model” on CUDA [5]. The saliency map model is a popular computational model for robotic vision to extract interesting objects from camera inputs. But the computational cost is high, and it is not efficient to run on CPU. This paper implements the saliency map model on CUDA-based GPU, and can process high speed camera inputs in real time, which is much faster than a standard CPU implementation. The implementation uses different memory types in the CUDA memory hierarchy according to the different requirement in each part of the algorithm. GPU computing is used by Tuck et al. to accelerate a mobile robot control system [6]. The map-merging step involves combining laser rangefinder data with stereovision inputs, which is slow on a CPU. After porting this and some other steps to be run on GPU, and optimizing with a GPU- targeting compiler, Bacon, the overall performance has increased to near real time. The computing steps that are parallel in nature, including laser data processing and map merging, are accelerated greatly by GPGPU. Also, Par and Tosun describe CUDA acceleration for localization based on GPS and map matching [7]. The vehicle location is estimated by a GPS reading first. Then current GPS