International Journal of Computer Vision, 9:1, 31-53 (1992) 0 1992 Kluwer Academic Publishers, Manufactured in The Netherlands. Dynamic Integration of Height Maps into a 3D World Representation from Range Image Sequences MINORU ASADA, MASAHIRO KIMURA, YASUHIRO TANIGUCHI, AND YOSHIAKI SHIRAI Mechanical Engineering for Computer-Controlled Machinery, Osaka University, Suita, Osaka 565, Japan Received Abstract Integration of 2 1/2D sketches obtained at different observation stations into a consistent world (or object) represen- tation is one of the central issues in computer vision and robotics. The resolution and accuracy of 2%D sketches may be different from one view point to another, and inconsistent data between different observations may occur. This article presents an approach to building a spatiotemporal representation of dynamic scenes including moving objects from a sequence of range images taken by a moving observer. A range image is transformed into a height- map representation, which is segmented into the ground plane and objects on it. In order to capture the resolution and accuracy of the range information and the consistency of the height information between different height maps, we define a reliability measure of the height information for each bucket on the height map. Using this reliability, the system finds the correspondences of both static and moving objects between different observations, and suc- cessively refines the height information and its reliability with newly acquired data, dealing with inconsistent data. Final representation of the integrated height map consists of the time stamp of the last observation, region labels of static and moving objects and their spatiotemporal properties such as height information, its reliability, and the velocities of both the observer and independently moving objects. We applied the method to road scenes physically simulated by landscape toy models and show the experimental results. 1 Introduction Representing a scene in terms of 3D object models and their spatial relationships is one of the central issues in computer vision. So far, extensive studies have been dedicated to the process of extracting scene features (2%D sketches) from 2D imagery [l], and little atten- tion has been paid to the integration of 2%D sketches into a consistent world (or object) representation. Since the higher tasks of computer vision such as interpreting a scene and spatial reasoning about it seriously depend on reliable descriptions of the objects, integration of 2%D sketches obtained at different observation stations into a consistent world (or object) representation is a fundamental task in computer vision. Although the task of this kind seems to be straightforward when using 3D features, we have to solve the following problems in order to build a 3D world representation of dynamic scenes from a sequence of observations: (i) we have to find correspondences of independently moving ob- jects as well as static ones between consecutive views taken by a moving observer, and (ii) we must integrate the sensor outputs of different sensor resolutions and accuracies into a consistent representation. Kalman filter techniques have been proposed for im- proving the accuracy of location data of sparse features (feature points or line segments) and/or the motion parameters of the observer [2, 31. The Kalman filter is a viable tool for real-time signal processing of dy- namic low-level data because it incorporates represen- tation of uncertainty and provides an optimal estimate of the fused data in a statistical sense. In those cases, the matching process is straightforward because the feature points or line segments were sparse, and there- fore, they are easily trackable. For dense data, Matthies et al. [4] proposed a Kalman filter-based algorithm for estimating depth from image sequences using very ac- curate motion parameters for the observer. Elfes [5] proposed a 2D occupancy grid as a world representation and determined the probability of the