International Journal of Computer Vision, 3, 181-208 (1989) 0 1989 Kluwer Academic Publishers. Manufactured in The Netherlands. From Image Sequences to Recognized Moving Polyhedral Objects DAVID W. MURRAY, DAVID A. CASTELOW AND BERNARD F. BUXTON Department of Engineering Science, University of Oxford, Parks Road, Oxford, OX1 3PJ, UK Abstract This paper describes the combination of several novel algorithms into a system that obtains visual motion from a sequence of images and uses it to recover a three-dimensional description of the motion and geometry of the scene in terms of moving extended straight edges. The system goes on to recognize the recovered geometry as an object from a database of wireframe models, a stage that also resolves the depth/speed scaling ambiguity in- herent in visual motion processing, resulting in absolute depth and motion recovery. The processing sequence is demonstrated on imagery from a well-carpentered CSG mode1 and on natural imagery of simple polyhedra1 objects. 1 Introduction The primary aim of research into computational vision, and indeed of that into many other automated sensing techniques, is to give machines the power to perceive the three-dimensional nature of the environment in which they are required to take intelligent action. More often than not, action involves movement, and so the recovery of three-dimensional motion at a low level of the sensory processing is of great importance in robotics. The two-dimensional visual motion derived from a sequence of time-varying images is one valuable source of information about the 3D scene and its motion relative to the sensor. At the most basic level. visual motion can be used simply to flag scene motion. but it has long been appreciated, certainly since the work of von Helmholtz [I], that encoded within it is much more detail about the 3D geometric structure and 3D motion of the scene. The capability to exploit this in the human visual system has been demonstrated in a variety of psychophysical experiments over many years [2,3,4] but it is only recently that computing resources have been sufficient to spur the derivation of detailed computational schemes to do the same-that is, solve the structure-from-motion problem. In reality, the notion of orre structure-from-motion problem with ooze solution is quite erroneous. The growing literature on motion processing explores a near plethora of problems and solutions and all show that, with the extra degrees of freedom introduced by un- constrained relative motion between camera and viewed scene. obtaining reliable structure from image motion is difficult. There are several contributory factors. In the first place, obtaining visual motion itself from a sequence of images is prone to error. Secondly, structure-from-motion algorithms are notoriously ill- conditioned with respect to such errors in the visual motion, and thus demand both high quality visual motion and a high quality segmentation of the visual motion. Yet another pitfall is that the visual motion field computed from the imagery may not relate simply to the geometric motion field (i.e., the projected scene motion) because of lighting and occlusion effects [5,6]. Broadly speaking, these difficulties are emphasized when gradient-based methods are used to compute visual motion (e.g. [7,8,9]) and 3D surface structure is recovered, whether that of planar facets [10,11,12]or curved surfaces [13,14,15]. By contrast, the difficulties appear alleviated when token matching schemes are used to compute visual motion and 3D point structure is recovered. Two recent highly successful schemes have used image corners as the matching tokens [16,17]. Unfortunately, the use of distinctive tokens can often result in rather sparse scene descriptions. In this paper, we describe the combination of algorithms into a system (ISOR) which obtains visual motion from an image sequence and uses it to recover a three-dimensional description of the motion and geometry of the scene in terms of moving extended edges. The system goes on where possible to recognize the recovered geometry as an object from a database of 3D wireframe models. As we will discuss again in our conclusions, the edge-based scheme appears to lie, in terms of both simplicity and richness of scene description, somewhere between schemes based on cor- ners and those based on surfaces. In this work we choose to restrict processing to straight extended edges,