FAST VARIABLE CENTER-BIASED WINDOWING FOR HIGH-SPEED STEREO ON PROGRAMMABLE GRAPHICS HARDWARE Jiangbo Lu ∗,† , Gauthier Lafruit † , and Francky Catthoor ∗,† ∗ Department of Electrical Engineering, University of Leuven, Belgium † Multimedia Group, IMEC, Kapeldreef 75, B-3001, Leuven, Belgium ABSTRACT We present a high-speed dense stereo algorithm that achieves both good quality results and very high disparity estimation throughput on the graphics processing unit (GPU). The key idea is a variable center- biased windowing approach, enabling an adaptive selection of the most suitable support patterns with varying sizes and shapes. As the fundamental construct for variable windows, a truncated separable Laplacian kernel approximation is proposed for the efﬁcient pixel- wise weighted cost aggregation. We also present a number of critical optimization schemes to boost the real-time speed on GPUs. Our method outperforms previous GPU-based local stereo methods and even some methods using global optimization on the Middlebury stereo database. Our optimized implementation completely running on an Nvidia GeForce 7900 graphics card achieves over 605 million disparity estimations per second (Mde/s) including all the overhead, about 2.1 to 12.1 times faster than the existing GPU-based solutions. Index Terms— Stereo vision, real-time dense stereo, GPGPU 1. INTRODUCTION Depth from stereo is an important computer vision topic that has at- tracted intensive research interests for decades. A substantial amount of work has been done on stereo, which is systematically surveyed and evaluated by Scharstein and Szeliski [1]. In general, casting a stereo problem as a global optimization problem usually leads to high quality disparity estimation results, but most of these global techniques are too computationally expensive for online processing. Real-time stereo applications today still largely rely on some local methods together with a winner-takes-all (WTA) decision strategy. Typically, local window-based approaches choose to aggregate the matching cost over a given support window to increase the ro- bustness to noise and texture variation. However, to obtain accurate results at depth discontinuities as well as on homogeneous regions, an appropriate support window for each pixel should be decided adaptively. To this end, several local methods have been proposed. For instance, Fusiello et al. [2] performed the correlation with nine windows anchored at different points and retained the disparity with the smallest matching cost. However, this method and its gener- alized technique, i.e., shiftable windows [1] usually require a rela- tively large number of candidate support windows to achieve good estimation results, and moreover their box-ﬁlters cannot adequately differentiate the impact of support pixels with different spatial lo- cations. Recently, Yoon and Kweon [3] proposed a state-of-the-art local window method yet at a very demanding computational cost, where pixel-wise support-weights are deﬁned using a Laplacian ker- nel, and they modeled the grouping strength for each support pixel. Nonetheless, solely resorting to local methods is not a cure-all for achieving dense stereo at high video rate. In fact, until recently software-only real-time stereo systems begin to emerge, which ex- ploit assembly level instruction optimization using Intel’s MMX ex- tension, but few CPU cycles are left to perform other tasks including high-level interpretation of the stereo results. Harnessing some pow- erful built-in features of the modern graphics processing unit (GPU), Yang et al. ﬁrst proposed a pyramid-shaped correlation kernel [4] and small-scale adaptive support windows [5]. Though very impres- sive disparity estimation throughput is obtained on GPUs, these tech- niques cannot strike an optimal quality balance between homoge- neous and heterogeneous regions. Later on, Gong and Yang [6] pro- posed an image-gradient-guided correlation method with improved accuracy, while still maintaining real-time speed on GPUs. Inspired by [3], Wang et al. [7] recently introduced an adaptive aggregation step in a dynamic-programming stereo framework. The high-quality results are obtained by their complicated cost aggregation and global optimization strategy, and a real-time speed is enabled by utilizing the unique processing capabilities of both the CPU and the GPU. This paper presents a novel stereo algorithm that is specially designed to achieve the competitive disparity quality and the high- speed execution on GPUs. At the heart of the proposed algorithm is a variable center-biased windowing approach, enabling an adaptive selection of the most suitable support patterns for different regions. Our method is in spirit similar to the variable window approach [8], but it is much faster by avoiding the costly dynamic programming. Concerning the real-time speed, the proposed method is by far the fastest among all these GPU-based approaches. The major con- tributing factors are three-folds: 1) our highly efﬁcient core stereo processing, 2) a number of special implementation optimizations on the GPU, and 3) upgrading to the advanced graphics hardware. Completely running on an Nvidia GeForce 7900 graphics card, our optimized implementation achieves over 605 million disparity esti- mations per second (Mde/s), compared to a maximum speed of 289 Mde/s in [5], 117 Mde/s in [6], and 50 Mde/s on CPU+GPU in [7]. 2. THE PROPOSED STEREO MATCHING ALGORITHM Following the taxonomy in [1], our stereo algorithm contains three major steps: matching cost computation, cost aggregation, and ﬁ- nally disparity selection. In the ﬁrst step, a matching cost for ev- ery possible disparity value of each pixel is computed. To suppress the inﬂuence of mismatches during the subsequent cost aggregation step, we adopt the truncated absolute difference (TAD) as the match- ing cost measure. Similar to most local approaches, the proposed al- gorithm places a key emphasis on the cost aggregation step to reduce the ambiguity in matching, and we will therefore focus on this core part for the remaining of this Section. In the last disparity selection step, a local WTA optimization is performed at each pixel, simply choosing the disparity associated with the minimum cost value. The entire framework of our stereo algorithm is illustrated in Fig. 1. The proposed cost aggregation step is composed of two parts: 1) VI - 568 1-4244-1437-7/07/$20.00 ©2007 IEEE ICIP 2007