A Statistical Analysis of Visual Cues for Estimating Dense Range Maps
Sergio A. Rosales-Morales and Luz A. Torres-M´ endez
Robotics and Advanced Manufacturing Group, CINVESTAV Unidad Saltillo
Ramos Arizpe, Coahuila, Mexico.
sergio.rosales; abril.torres@cinvestav.edu.mx
Abstract
A method for recovering dense range maps from sparse
range maps by using statistical analysis of visual cues is
presented. The proposed technique is based on construct-
ing a 3D map of a real environment, which in turn, requires
visual information to densely cover the environment to be
placed. Moreover, the method relies only on the information
coming from intensity images taken at the scene in question
and compared to existing work, use a small, but representa-
tive, set of visual cues to estimate their geometry. The steps
for implementing the proposed technique require obtaining
an initial (sparse) geometric information from stereo vision.
A set of visual characteristics with relevant geometric infor-
mation is extracted by statistically analyzing small patches
from data. These characteristics help to assign confidence
values to the sparse range map and apply a range synthe-
sis algorithm based on a Markov field model to estimate a
complete dense range map. Preliminary experimental re-
sults validate the proposed method.
1 Introduction
One of the main goals of computer vision is to recover
the geometric structure of objects from images. Surface
depth recovery is essential in multiple applications involv-
ing robotics and computer vision. Several important appli-
cations (e.g. virtual exploration of remote and hazardous
places for security and inspection tasks) require a 2.5D map
of the environment, having a robot able to build a reliable
map is particularly appealing, as these applications depend
on the transmission of meaningful visual and geometric in-
formation. However, the problem of inferring the underly-
ing structure from visual images lacks of an analytical solu-
tion.
Different types of visual cues have been used, the so-
called shape from X techniques extract depth information
from intensity images by using cues such as shading, tex-
ture, retinal disparity and motion. These models are tra-
ditionally based on physical principles of light interaction.
However, due to the highly under-constrained characteris-
tic of the inverse problem of these principles, many as-
sumptions about the type of surface and albedo need to
be made, which may not be all suitable for the complex
real scenes. Dense stereo vision gained popularity in the
early 1990’s due to the large amount of range data that it
could provide [9, 4]. In mobile robotics, a common setup
is the use of one or two cameras mounted on the robot to
acquire depth information as the robot moves through the
environment. Over the past decade, researchers have de-
veloped very good stereo vision systems (see [14] for a re-
view). Although these systems work well in many envi-
ronments, the cameras must be precisely calibrate for rea-
sonably accurate results. Also, the results are limited by
the baseline distance between the two cameras. The depth
estimates tend to be inaccurate when the distances consid-
ered are large. The depth maps generated by stereo under
normal scene conditions (i.e., no special textures or struc-
tured lighting) suffer from problems inherent in window-
based correlation. These problems manifest as imprecisely
localized surfaces in 3D space and as hallucinated surfaces
that in fact do not exist. Recently, there has been much
progress on using learning approaches in stereo vision. One
of the top-performing methods for stereo vision is the work
by Zhang and Seitz [22], who iteratively estimate the global
parameters of a MRF stereo from the previous disparity es-
timates, without having to rely on ground-truth data. In a
more recent work, Saxena et al. [13] incorporate monocular
cues from a single image into a stereo system for modeling
depths and relationships between depths at different points
in the image using a hierarchical, multi-scale MRF. A train-
ing set, comprising a large set of stereo pairs and their cor-
responding ground-truth depth maps, is used to model the
posterior distribution of the depths given the monocular im-
age features and the disparities.
Saxena et al. [12] applied supervised learning to the
problem of estimating depth from single monocular cues on
images of unconstrained outdoor and indoor environments.
This task is difficult since requires a significant amount of
2008 Seventh Mexican International Conference on Artificial Intelligence
978-0-7695-3441-1/08 $25.00 © 2008 IEEE
DOI 10.1109/MICAI.2008.57
220