First Steps towards Stereo-based 6DOF SLAM for the Visually Impaired Juan Manuel S´ aez Francisco Escolano Antonio Pe ˜ nalver Robot Vision Group, Departamento de Ciencia de la Computaci´ on e Inteligencia Artificial Universidad de Alicante, Ap.99, E-03080, Alicante, Spain {jmsaez,sco,apenalver}@dccia.ua.es Abstract In this work we have embodied a full 6DOF SLAM so- lution into a wearable stereo device working in near real- time. In order to serve on-line metric (map) and positio- nal (localization) to the blind or visually impaired we intro- duce three basic elements: (i) a real-time egomotion estima- tion integrating 3D and 2D (appearance) information; (ii) a randomized algorithm for global rectification by entropy minimization; and (iii) a time-dependent view integration strategy for reducing the number of variables and achieving near real-time performance in global rectification. We show successful experimental results which illustrate the capabi- lities of both the egomotion and the global-rectification al- gorithm in terms of supporting 6DOF, yielding statistical robustness and enabling scalable solutions. 1. Introduction The Simultaneous Localization and Mapping (SLAM) problem is the estimation of both the relative positions of the environmental features with respect to the observer and the position of the observer itself. SLAM has been recog- nized as a fundamental problem in robotics because its so- lution is key for endowing robots with real autonomous ca- pabilities. However, SLAM has also captured the attention of computer vision researchers interested in building use- ful wearable devices (camera + computer/PDA) for serving on-line metric (maps) and positional (localization) informa- tion to the observers, which is key in the case of blind and visually impaired people. Considering the broad spectrum of SLAM approaches there are two extremes depending whether we pretend to build a dense map of the environment or not. In the first case, when the used sensors yield dense enough data for building 3D maps of the environment (typically laser range finders [24][11][13][7][23]) the problem of simultaneously computing the map and observer poses is formulated in terms of maximizing a log-likelihood function and then using an EM-like algorithm to obtain, at least, a local ma- ximum. In some of these algorithms, the core of SLAM is the registration of succesive point clouds by means of an Iterative Closest Point (ICP) [2][17] algorithm, embo- died in a computational scheme which enforces of a global- consistency criterion [9]. The other extreme of the SLAM spectrum is characte- rized by algorithms obtaining sparse maps, that is, only a few features are tracked and used to compute the position of the observer. In this latter case, a Extended Kalman Fil- ter (EKF) is typically used, and simple data-association me- chanisms are used for solving the registration problem bet- ween features in consecutive views. The two fundamen- tal weaknesses of this approach (a quadratic complexity with the number of features, and the collapse of the filter when data-association fails) are partially solved in the re- cent FastSLAM approach [15][8][25]. Although the sensor used in these cases are typically range (laser) sensors, the sparse approach has been recently brought to the compu- ter vision arena [5][4]. However, in these latter algorithms, if one wants real-time solutions, the registration (structure from motion) problem must be circumvented because sol- ving it accurately implies intensive batch processing (off- line). Consequently, only short sequences can be tracked without errors. However, there are recent experiments em- bodying structure-from-motion in the filter[14]. Computer vision solutions are desirable because came- ras (specially monocular ones) are cheaper than 2D (and obviously 3D) lasers. However, in the middle of the SLAM spectrum we may find stereo-based solutions which com- pute semi-sparse 3D maps. In [16], where 3D information yields 2D maps, these maps are represented with a occu- pancy grid models. In other cases, like in [10], stereo data even allow the computation of 3D planes although some manual guidance is necessary when there is no data. In [12], stereo vision is fused with inertial information in order to re- cover 3D segments. On the other hand, in [18][19] 3D land- marks based on scale-invariant image features are used to compute the map. Such a computation relies on estimating the ego-motion of the robot, tracking the landmarks using