Reward-based learning of optimal cue integration in audio and visual
depth estimation
Cem Karaoguz
1,2,*
, Thomas H Weisswange
3,*
, Tobias Rodemann
2,*
, Britta Wrede
1
and Constantin A Rothkopf
3
Abstract— Many real-world applications in robotics have to
deal with imprecisions and noise when using only a single
information source for computation. Therefore making use
of additional cues or sensors is often the method of choice.
One examples considered in this paper is depth estimation
where multiple visual and auditory cues can be combined to
increase precision and robustness of the final estimates. Rather
than using a weighted average of the individual estimates we
use a reward-based learning scheme to adapt to the given
relations amongst the cues. This approach has been shown
before to mimic the development of near-optimal cue integration
in infants and benefits from using few assumptions about
the distribution of inputs. We demonstrate that this approach
can substantially improve performance in two different depth
estimation systems, one auditory and one visual.
I. INTRODUCTION
The combination of different cues to improve the per-
formance in tasks like segmentation [1], [2], object iden-
tification [3], or object tracking [4] is a common method
in robotics. Merging different cues with complementary or
partially redundant characteristics has a good potential to
improve both precision and robustness (for example reduce
mean and maximum estimation error). The optimal integra-
tion is theoretically well defined and straightforward in a
Bayesian framework. However, for real world applications,
performing this computation is usually intractable. A com-
monly used approximation is a weighted sum of the maxi-
mum likelihood estimates of the different cues [5]. Such an
approach is computationally efficient but is only guaranteed
to be close to the optimal solution for the idealized case
of cues with uncorrelated Gaussian noise and knowledge of
error variances and potential biases. Unfortunately for many
practical applications these conditions are not necessarily
met. Additionally fixed weights can lead to problems when
the environment changes. An adaptive approach has been
presented in [4], where an optimal weighting of different
cues in an object tracking task is learned online.
In this work we present a different approach that uses a
general reward-based learning scheme for training a neural
network to combine depth estimations from multiple cues.
1 Research Institute for Cognition and Robotics (CoR-
Lab), Bielefeld University, 33594 Bielefeld, Germany
ckaraogu@cor-lab.uni-bielefeld.de
2 Honda Research Institute Europe GmbH, Carl-Legien-Str. 30, 63073
Offenbach, Germany tobias.rodemann@honda-ri.de
3 Frankfurt Institute for Advanced Studies, Ruth-
Moufang-Strasse 1, 60438 Frankfurt, Germany
weisswange@fias.uni-frankfurt.de
* These authors contributed equally
The approach was developed to model the development of
optimal cue integration in infants [6], [7] without making
assumptions about the characteristics of the cues. We test
this approach in two different robotics tasks. The first is
auditory depth estimation using stereo recordings from a
humanoid robot, the second one is a visual depth estimation
task in a stereo camera setup with vergence. Such depth
estimation is a basic prerequisite for important behaviors like
navigation, grasping or verbal interaction. In both sensory
domains this task is challenging if only standard sensors (i.e.
two microphones or two cameras) are available.
Both systems have been described in previous papers
[8], [9] and will only be explained briefly here. The cue
integration method was outlined in detail in [6], [7].
For both applications learning is done offline in a standard
training session using only part of the recorded sensory
data. The reward signal used to adapt the neural network
is based on the accuracy of its response to an input. Since
we have labeled data, this accuracy simply depends on the
difference between estimated and true depth, but in general
could relate to a behavioral outcome, e.g. the success of a
grasping movement. The weight are updated using a gradient
descent method. After training, performing cue integration
can be done with minimal computational effort.
For both sensory domains our results show a substantial
reduction in mean and maximum depth estimation error com-
pared with those of the best individual cue. This was possible
although the quality of individual cues varied heavily across
the input space, they were strongly correlated and showed
significant biases. For these reasons the new method was
able to also outperforme standard weighted cue averaging.
II. AUDIO DEPTH ESTIMATION
Estimating the depth of a sound source is notoriously
difficult, especially when only one or two microphones are
available. If no triangulation is possible (either by moving
the robot or by using several pairs of microphones) no
direct, unambiguous cue to depth is available. In a previ-
ously described system [8] we therefore used a combination
of many different depth cues (outlined below) that were
computed and averaged over a complete sound segment.
In this framework sounds are considered as (proto) objects
with a set of attached audio features. Each of these audio
features i is mapped to a depth estimation D
i
audio
(z), where
D
i
audio
(z) is the evidence for one of 9 different depths (z)
based on the current values of cue i . The mapping from
The 15th International Conference on Advanced Robotics
Tallinn University of Technology
Tallinn, Estonia, June 20-23, 2011
978-1-4577-1159-6/11/$26.00 ©2011 IEEE 389