DENSE URBAN DEM WITH THREE OR MORE HIGH-RESOLUTION AERIAL IMAGES U˘ gur M. Lelo˘ glu , Michel Roux , Henri Maˆ ıtre T ¨ UB ˙ ITAK-B ˙ ILTEN METU, 06531 Ankara, Turkey Ph.: +90 312 210 46 13, Fax: +90 312 210 13 15 E-mail: lel@tbtk.metu.edu.tr Signal and Image Department, ENST 46 rue Barrault, 75013 Paris, France Ph.: +33 1 45 81 81 28, Fax: +33 1 45 81 37 94 E-mail: mroux,maitre@ima.enst.fr KEY WORDS: stereo, 3-D reconstruction, DEM, image matching, high-resolution aerial imagery ABSTRACT In cartographic applications, area-based matching techniques are commonly used for stereo matching of low-resolution aerial images. However, these techniques fail in matching high-resolution aerial images of urban areas because of relatively frequent and sharp depth discontinuities and large occluded or textureless areas present in such images, as compared to low-resolution aerial images. This paper presents a hierarchical correlation-based matching technique which fuses information from multiple image pairs and employs a support-collection mechanism in the object space as well as a relaxation algorithm to resolve ambiguities, producing accurate and dense disparity maps. The disambiguating power of the algorithm and the use of multiple pairs allow us to use very small correlation windows, so that the computational complexity is kept small and boundary overreach problem is avoided. They also allow us not to use any threshold on correlation values; as a result, very dense disparity maps can be obtained. 1 INTRODUCTION Obtaining digital elevation models (DEM) from aerial images is useful for a number of cartographic applications. The use of cor- relation-based stereo in establishing DEMs is common and well- studied. Such techniques are known to be successful in low-res- olution images or in images of non-urban areas where the depth changes smoothly and where there exists rich texture. In this paper, we address correlation-based stereo correspondence in the domain of high-resolution aerial images of urban areas, that typically contain large textureless regions (e.g. roads and espe- cially roofs which are of great importance), frequent sharp depth discontinuities, and, large occlusions. The images on which we develop and test the method presented here are 24-bit RGB aerial images of West European industrial or urban zones with a ground resolution of 8cm. The internal camera parameters are readily available and external parameters can easily be determined using standard calibration techniques. This paper is organised as follows: In the following section, some related work is summarised. In section 3, a hierarchical relaxation algorithm, which calculates the disparity maps from multiple image pairs simultaneously, is described. In section 4, some experimental results are presented and, finally, the paper is concluded with a discussion on results in section 5. 2 RELATED WORK One of the ways to use more than two images in stereo reconstruc- tion is to construct epipolar image pairs to obtain disparity maps with conventional stereo techniques, and then, to merge the result- ing matches in the object space. There is a rich literature towards the integration of depth data from multiple sources, not necessar- ily from stereo, but from shape-from-shading (Ferrie and Levine, 1987) or range images (Shum et al., 1994)(Higuchi et al., 1993). An interesting work to merge disparity maps resulting from multiple Supported by T ¨ UB ˙ ITAK (The Scientific and Technical Research Council of Turkey) and by IMPACT: Esprit project 20243 stereo pairs is that of (Fua, 1997) where small patches (“oriented particles”) are fitted to matches in 3-D object space to estimate underlying surface. Another way of using three or more images is to employ a corre- lation-like similarity measure defined over all images involved. In multibaseline stereo, the pixels in each image, corresponding to a given pixel in the reference image and a given depth, can easily be found. The sum of squared differences (SSD) within a window around those pixels, which was first used by (Okutomi and Kanade, 1993) in this context, can be drawn as a function of depth. (Kang and Szeliski, 1997) use SSD in panoramic images; (Park and In- oue, 1997) use only two median of four differences obtained from five cameras to overcome the problem of occlusion; (Scharstein and Szeliski, 1996) use an adaptive support region instead of a square window; and (Canu et al., 1995) use sum of normalised correlations instead of SSD. A third way is to project all possible matches from multiple pairs to 3-D, and then, to choose the true matches in object space. (Zitnick and Webb, 1996) project matches from multiple cameras with re- spect to a reference camera to 3-D and eliminate some of the false matches by tracking each match, in all pairs, in increasing order of baseline distance. So, a point can be matched only when it can be seen from all cameras. The remaining 3-D points are grouped into continuous surfaces, considering their depth differences in 3- D and their pixel distance in 2-D. The most numerous groups are assumed to correspond to true surfaces. 3 DESCRIPTION OF THE ALGORITHM In the case of merging disparity maps from multiple stereo image pairs, one does not benefit from the information in three or more images during the matching process. But, some false matches could be eliminated or more matches could be obtained in that early phase. The use of correlation-like measures defined on three or more images are more powerful in that sense, however, a match which is very clear in one pair of images (i.e., a very sharp and large peak in the correlation signal) can be lost because of noise, occlusion or high disparity gradient. Besides, when all cameras are