Reward-based learning of optimal cue integration in audio and visual depth estimation Cem Karaoguz 1,2,* , Thomas H Weisswange 3,* , Tobias Rodemann 2,* , Britta Wrede 1 and Constantin A Rothkopf 3 Abstract— Many real-world applications in robotics have to deal with imprecisions and noise when using only a single information source for computation. Therefore making use of additional cues or sensors is often the method of choice. One examples considered in this paper is depth estimation where multiple visual and auditory cues can be combined to increase precision and robustness of the ﬁnal estimates. Rather than using a weighted average of the individual estimates we use a reward-based learning scheme to adapt to the given relations amongst the cues. This approach has been shown before to mimic the development of near-optimal cue integration in infants and beneﬁts from using few assumptions about the distribution of inputs. We demonstrate that this approach can substantially improve performance in two different depth estimation systems, one auditory and one visual. I. INTRODUCTION The combination of different cues to improve the per- formance in tasks like segmentation [1], [2], object iden- tiﬁcation [3], or object tracking [4] is a common method in robotics. Merging different cues with complementary or partially redundant characteristics has a good potential to improve both precision and robustness (for example reduce mean and maximum estimation error). The optimal integra- tion is theoretically well deﬁned and straightforward in a Bayesian framework. However, for real world applications, performing this computation is usually intractable. A com- monly used approximation is a weighted sum of the maxi- mum likelihood estimates of the different cues [5]. Such an approach is computationally efﬁcient but is only guaranteed to be close to the optimal solution for the idealized case of cues with uncorrelated Gaussian noise and knowledge of error variances and potential biases. Unfortunately for many practical applications these conditions are not necessarily met. Additionally ﬁxed weights can lead to problems when the environment changes. An adaptive approach has been presented in [4], where an optimal weighting of different cues in an object tracking task is learned online. In this work we present a different approach that uses a general reward-based learning scheme for training a neural network to combine depth estimations from multiple cues. 1 Research Institute for Cognition and Robotics (CoR- Lab), Bielefeld University, 33594 Bielefeld, Germany ckaraogu@cor-lab.uni-bielefeld.de 2 Honda Research Institute Europe GmbH, Carl-Legien-Str. 30, 63073 Offenbach, Germany tobias.rodemann@honda-ri.de 3 Frankfurt Institute for Advanced Studies, Ruth- Moufang-Strasse 1, 60438 Frankfurt, Germany weisswange@fias.uni-frankfurt.de * These authors contributed equally The approach was developed to model the development of optimal cue integration in infants [6], [7] without making assumptions about the characteristics of the cues. We test this approach in two different robotics tasks. The ﬁrst is auditory depth estimation using stereo recordings from a humanoid robot, the second one is a visual depth estimation task in a stereo camera setup with vergence. Such depth estimation is a basic prerequisite for important behaviors like navigation, grasping or verbal interaction. In both sensory domains this task is challenging if only standard sensors (i.e. two microphones or two cameras) are available. Both systems have been described in previous papers [8], [9] and will only be explained brieﬂy here. The cue integration method was outlined in detail in [6], [7]. For both applications learning is done ofﬂine in a standard training session using only part of the recorded sensory data. The reward signal used to adapt the neural network is based on the accuracy of its response to an input. Since we have labeled data, this accuracy simply depends on the difference between estimated and true depth, but in general could relate to a behavioral outcome, e.g. the success of a grasping movement. The weight are updated using a gradient descent method. After training, performing cue integration can be done with minimal computational effort. For both sensory domains our results show a substantial reduction in mean and maximum depth estimation error com- pared with those of the best individual cue. This was possible although the quality of individual cues varied heavily across the input space, they were strongly correlated and showed signiﬁcant biases. For these reasons the new method was able to also outperforme standard weighted cue averaging. II. AUDIO DEPTH ESTIMATION Estimating the depth of a sound source is notoriously difﬁcult, especially when only one or two microphones are available. If no triangulation is possible (either by moving the robot or by using several pairs of microphones) no direct, unambiguous cue to depth is available. In a previ- ously described system [8] we therefore used a combination of many different depth cues (outlined below) that were computed and averaged over a complete sound segment. In this framework sounds are considered as (proto) objects with a set of attached audio features. Each of these audio features i is mapped to a depth estimation D i audio (z), where D i audio (z) is the evidence for one of 9 different depths (z) based on the current values of cue i . The mapping from The 15th International Conference on Advanced Robotics Tallinn University of Technology Tallinn, Estonia, June 20-23, 2011 978-1-4577-1159-6/11/$26.00 ©2011 IEEE 389