Achieving Domain Robustness in Stereo Matching Networks by Removing Shortcut Learning WeiQin Chuah, Ruwan Tennakoon, and Alireza Bab-Hadiashar RMIT University, Australia {wei.qin.chuah,ruwan.tennakoon,abh}@rmit.edu.au David Suter Edith Cowan University (ECU), Australia d.suter@ecu.edu.au Abstract Learning-based stereo matching and depth estimation networks currently excel on public benchmarks with impres- sive results. However, state-of-the-art networks often fail to generalize from a synthetic imagery to more challenging real data domains. This paper is an attempt to uncover hid- den secrets of achieving domain robustness and in particu- lar, discovering the important ingredients of generalization success of stereo matching networks by analyzing the effect of synthetic image learning on real data performance. We provide evidence that demonstrates that learning of features in synthetic domain by a stereo matching network is heav- ily influenced by two “shortcuts” presented in the synthetic data: (1) identical local statistics (RGB colour features) be- tween matching pixels in the synthetic stereo images and (2) lack of realism in synthetic textures on 3D objects simu- lated in game engines. We will show that by removing such shortcuts, we can achieve domain robustness in the state-of- the-art stereo matching frameworks and produce remark- able performance on multiple realistic datasets, despite the fact that the networks were trained on synthetic data, only. Our experimental results point to the fact that eliminating shortcuts from the synthetic data is key to achieve domain- invariant generalization between synthetic and real data domains. 1. Introduction Stereo matching is a fundamental problem in computer vision and is widely used in various applications such as augmented reality (AR), robotics and autonomous driving. Stereo matching aims to estimate depth by computing the horizontal displacement of pixel correspondences between a pair of stereo images. In recent years, many end-to-end (a) Left Image EPE=1.19 EPE=2.42 (b) Baseline EPE=1.14 EPE=1.26 (c) Ours Figure 1: (Best view in color and zoom in) Performance comparison between stereo matching networks with (base- line) and without the shortcuts removed. The perfor- mance of the baseline network deteriorated when adversar- ial noises that is hardly visible to human eyes are added to the stereo image (bottom). Convolutional Neural Networks (CNNs) have been devel- oped to perform stereo matching and achieved outstand- ing results on several publicly available datasets or bench- marks [6, 11, 16, 42, 49]. In practice, the state-of-the-art stereo matching networks are trained in a supervised fash- ion where annotated datasets are required to fine-tune the models from synthetic to real data domains. However, the ground-truth disparity labels are cumbersome to generate in real-world scenarios. A major drawback of the existing learning-based stereo matching networks is their inability to generalize to unseen domains. It is commonly understood that this is due to do- main differences between the training and testing data [39]. The differences may include discrepancies in image ap- pearance, style and contents between datasets. To over- come this, unsupervised domain adaptation (UDA) meth- ods were proposed to bridge the domain gaps between synthetic and real data, and to effectively transfer learned arXiv:2106.08486v1 [cs.CV] 15 Jun 2021