IEEE TRANSACTION ON IMAGE PROCESSING, VOL. PP, NO. , XX 2017 1 Fundamental Principles on Learning New Features for Effective Dense Matching Feihu Zhang, Student Member, IEEE, Benjamin W. Wah, Fellow, IEEE Abstract—In dense matching (including stereo matching and optical ﬂow), nearly all existing approaches are based on simple features, such as gray or RGB color, gradient or simple transfor- mations like census, to calculate matching costs. These features do not perform well in complex scenes that may involve radiometric changes, noises, overexposure and/or textureless regions. Various problems may appear, such as wrong matching at the pixel or region level, ﬂattening/breaking of edges and/or even entire structural collapse. In this paper, we propose two fundamental principles based on the consistency and the distinctiveness of features. We show that almost all existing problems in dense matching are caused by features that violate one or both of these principles. To systematically learn good features for dense matching, we develop a general multi-objective optimization based on these two principles and apply convolutional neural networks (CNNs) to ﬁnd new features that lie on the Pareto frontier. By using two-frame optical ﬂow and stereo matching as applications, our experimental results show that the features learned can signiﬁcantly improve the performance of state-of-the- art approaches. Based on the KITTI benchmarks, our method ranks ﬁrst on the two stereo benchmarks and is the best among existing two-frame optical-ﬂow algorithms on ﬂow benchmarks. Index Terms—Image Feature, CNN, Dense Matching, Optical Flow, Stereo Matching, Matching Cost. I. I NTRODUCTION S TEREO matching, optical ﬂow and other dense-matching applications have always been hot issues in computer vision. In the past, a number of methods have been developed to solve these problems. These methods consist of three steps: extracting features and their descriptors, computing the match- ing cost and/or aggregation [1]–[4], and applying matching algorithms [5]–[7] to minimize some energy functions. In recent years, there is a lot of attention on the last two steps. However, little has been done on feature extraction that is critical in dense-matching. The most popular features for stereo matching and optical ﬂow are still limited to some kind of color space or gradient values [8]. Although these simple features are fast to compute and ﬂexible (like in scaling and subpixel interpolation), they are easily inﬂuenced by radiometric changes, noise, overexposure and the scene environment (as shown in Fig. 1). This is also the major reason why some of the best methods that work well on benchmarks of simple indoor scenes [9] report limited suc- cess on benchmarks of complicated outdoor scenes [10]. On the other hand, the popular sparse features used for shape The authors are with the Chinese University of Hong Kong (e-mail: hi.yexu@gmail.com, bwah@cuhk.edu.hk). Research was supported in part by the National Grand Fundamental Research 973 Program of China No. 2014CB340401. Manuscript received XX 2016; revised XX 2017; accepted XX, 2017. matching and object detection (including SIFT [11] and SURF [12]) which are scale and radiometric invariant met their limitations on performance improvement and ﬂexibility when directly applied to dense matching. These methods were not designed for dense matching from the beginning, and they usually involve a complex step on label densifying [13]. Instead of developing new features for dense matching, some recent methods [14], [15] introduce convolutional neural networks (CNNs) to compare the similarity of a pair of patches and use the similarity score as the matching cost. These help achieve high accuracy when used in some stereo matching methods [14]. To address their high computational cost, Zbontar et al. proposed a faster framework with some sacriﬁce in accuracy [16]. However, as its time complexity depends on the displacement space, it still cannot be used for optical ﬂow and other complex algorithms, such as continuous matching with slanted surface or subpixel accuracy. (The time complexity is O(KNM ), with size K of displacement space, N pixels, and computation complexity M of CNN.) The primary problem in developing better dense matching algorithms is to identify good features. There has been little work in this area. A direct approach [8], [17] collects the error rates when employing one type of features in a speciﬁc algorithm. The rates, however, are not useful for designing feature extractors because they cannot provide quantitative information on the features of each pixel and/or region. Also, it is impractical to use them as targets because not only is it time consuming to run matching algorithms during feature extraction, the error rate of one matching algorithm cannot represent the feature’s performance in other algorithms. In this paper, we identify two fundamental principles on good features that each pixel should possess in order to be effective for dense matching. These principles help understand the requirements on good features for dense matching, as well as identifying the weaknesses of existing algorithms (as they violate one or both of these principles). The ﬁrst principle, the Consistency Principle, states that a feature point (a pixel/location where the feature performs well) should own the same or similar feature descriptors (such as RGB values when color is used as the feature) when it appears in different image views (such as the left and right views of a stereo pair). For example, many existing features like color or gradient are highly inﬂuenced by noise, radiometric variance, scaling change, translation and/or rotation. As illustrated in Fig. 1, such external disturbances can easily break the consis- tency of features between different views. The second principle, the Distinctiveness Principle, states that a feature point should be different enough with respect