I spy with my little eye: Learning Optimal Filters for Cross-Modal Stereo under Projected Patterns Wei-Chen Chiu Ulf Blanke Mario Fritz Max-Planck-Institute for Informatics, Saarbr ¨ ucken, Germany {walon, blanke, mfritz}@mpi-inf.mpg.de Abstract With the introduction of the Kinect as a gaming inter- faces, its broad commercial accessibility and high quality depth sensor has attracted the attention not only from con- sumers but also from researchers in the robotics commu- nity. The active sensing technique of the Kinect produces robust depth maps for reliable human pose estimation. But for a broader range of applications in robotic perception, its active sensing approach fails under many operating condi- tions such like objects with specular and transparent sur- faces. Recently, an initial study has shown that part of the aris- ing problems can be alleviated by complimenting the active sensing scheme with passive, cross–modal stereo between the Kinect’s rgb and ir camera. However, the method is troubled by interference from the IR projector that is re- quired for the active depth sensing method. We investigate these issues and conduct a more detailed study of the physi- cal characteristics of the sensors as well as propose a more general method that learns optimal ﬁlters for cross–modal stereo under projected patterns. Our approach improves results over the baseline in a point-cloud-based object seg- mentation task without modiﬁcations of the kinect hardware and despite the interference by the projector. 1. Introduction Despite the advance of elaborated global (e.g. [2]) and semi-global (e.g. [5]) stereo matching techniques, real- time stereo on standard hardware is still dominated by local method based on patch comparisons. The more surprising it is that we have seen very little work on improving the cor- respondences by a learning approach that would be better suited to a certain setting or conditions [10]. Yet the use of different pre-ﬁlters that are used by practitioners to improve the matching process are clear evidence that there is room for improvement over basic patch-based differences. In our case the need for learning is even more apparent (a) (b) (c) (d) (e) Figure 1: 1(a) Response of RGB camera (left) and IR cam- era (right). 1(b) and 1(c) Image pair obtained by Kinect with projected IR pattern. 1(d) Disparity map on unﬁltered pairs. 1(e) Disparity map on patch-ﬁltered image pairs. as we attempt cross-modal matching between the IR and RGB image obtained from the Kinect sensor. Such a sys- tem was recently proposed [1] which augments the active sensing strategy of the kinect by a passive stereo algorithm between the two available imagers. A very simple pixel based re-weighting scheme was proposed that produces an IR like image for improved depth estimates.