6D-Vision: Fusion of Stereo and Motion for Robust Environment Perception Uwe Franke, Clemens Rabe, Hern´ an Badino, and Stefan Gehrig DaimlerChrysler AG, 70546 Stuttgart, Germany {uwe.franke,clemens.rabe,hernan.badino,stefan.gehrig}@daimlerchrysler.com Abstract. Obstacle avoidance is one of the most important challenges for mobile robots as well as future vision based driver assistance systems. This task requires a precise extraction of depth and the robust and fast detection of moving objects. In order to reach these goals, this paper considers vision as a process in space and time. It presents a powerful fusion of depth and motion information for image sequences taken from a moving observer. 3D-position and 3D-motion for a large number of image points are estimated simultaneously by means of Kalman-Filters. There is no need of prior error-prone segmentation. Thus, one gets a rich 6D representation that allows the detection of moving obstacles even in the presence of partial occlusion of foreground or background. 1 Introduction Moving objects are the most dangerous objects in many applications. The fast and reliable estimation of their motion is a major challenge for the environment perception of mobile systems and of driver assistance systems in particular. The three-dimensional information delivered by stereo vision is commonly accumu- lated in an evidence-grid-like structure [10]. Since stereo does not reveal any motion information, usually the depth map is segmented and detected objects are tracked over time in order to obtain their motion. The major disadvantage of this standard approach is that the performance of the detection highly depends on the correctness of the segmentation. Especially moving objects in front of stationary ones – eg. the bicycle in front of the parking vehicles shown in fig- ure 1 – are often merged and therefore not detected. This can cause dangerous misinterpretations and requires more powerful solutions. Our first attempt to overcome this problem was the so called flow-depth con- straint [7]. Heinrich compared the measured optical flow with the expectation stemming from the known ego-motion and the 3D stereo information. Indepen- dently moving objects do not fulfil the constraint and can easily be detected. Unfortunately, this approach turned out to be very sensitive to small errors in the ego-motion estimation, since only two consecutive frames are considered. Humans do not have the above mentioned problems since we simultaneously evaluate depth and motion in the retinal images and integrate the observations over time [11]. The approach presented in this paper follows this principle. The