A Statistical Analysis of Visual Cues for Estimating Dense Range Maps Sergio A. Rosales-Morales and Luz A. Torres-M´ endez Robotics and Advanced Manufacturing Group, CINVESTAV Unidad Saltillo Ramos Arizpe, Coahuila, Mexico. sergio.rosales; abril.torres@cinvestav.edu.mx Abstract A method for recovering dense range maps from sparse range maps by using statistical analysis of visual cues is presented. The proposed technique is based on construct- ing a 3D map of a real environment, which in turn, requires visual information to densely cover the environment to be placed. Moreover, the method relies only on the information coming from intensity images taken at the scene in question and compared to existing work, use a small, but representa- tive, set of visual cues to estimate their geometry. The steps for implementing the proposed technique require obtaining an initial (sparse) geometric information from stereo vision. A set of visual characteristics with relevant geometric infor- mation is extracted by statistically analyzing small patches from data. These characteristics help to assign conﬁdence values to the sparse range map and apply a range synthe- sis algorithm based on a Markov ﬁeld model to estimate a complete dense range map. Preliminary experimental re- sults validate the proposed method. 1 Introduction One of the main goals of computer vision is to recover the geometric structure of objects from images. Surface depth recovery is essential in multiple applications involv- ing robotics and computer vision. Several important appli- cations (e.g. virtual exploration of remote and hazardous places for security and inspection tasks) require a 2.5D map of the environment, having a robot able to build a reliable map is particularly appealing, as these applications depend on the transmission of meaningful visual and geometric in- formation. However, the problem of inferring the underly- ing structure from visual images lacks of an analytical solu- tion. Different types of visual cues have been used, the so- called shape from X techniques extract depth information from intensity images by using cues such as shading, tex- ture, retinal disparity and motion. These models are tra- ditionally based on physical principles of light interaction. However, due to the highly under-constrained characteris- tic of the inverse problem of these principles, many as- sumptions about the type of surface and albedo need to be made, which may not be all suitable for the complex real scenes. Dense stereo vision gained popularity in the early 1990’s due to the large amount of range data that it could provide [9, 4]. In mobile robotics, a common setup is the use of one or two cameras mounted on the robot to acquire depth information as the robot moves through the environment. Over the past decade, researchers have de- veloped very good stereo vision systems (see [14] for a re- view). Although these systems work well in many envi- ronments, the cameras must be precisely calibrate for rea- sonably accurate results. Also, the results are limited by the baseline distance between the two cameras. The depth estimates tend to be inaccurate when the distances consid- ered are large. The depth maps generated by stereo under normal scene conditions (i.e., no special textures or struc- tured lighting) suffer from problems inherent in window- based correlation. These problems manifest as imprecisely localized surfaces in 3D space and as hallucinated surfaces that in fact do not exist. Recently, there has been much progress on using learning approaches in stereo vision. One of the top-performing methods for stereo vision is the work by Zhang and Seitz [22], who iteratively estimate the global parameters of a MRF stereo from the previous disparity es- timates, without having to rely on ground-truth data. In a more recent work, Saxena et al. [13] incorporate monocular cues from a single image into a stereo system for modeling depths and relationships between depths at different points in the image using a hierarchical, multi-scale MRF. A train- ing set, comprising a large set of stereo pairs and their cor- responding ground-truth depth maps, is used to model the posterior distribution of the depths given the monocular im- age features and the disparities. Saxena et al. [12] applied supervised learning to the problem of estimating depth from single monocular cues on images of unconstrained outdoor and indoor environments. This task is difﬁcult since requires a signiﬁcant amount of 2008 Seventh Mexican International Conference on Artificial Intelligence 978-0-7695-3441-1/08 $25.00 © 2008 IEEE DOI 10.1109/MICAI.2008.57 220