Tyzx DeepSea High Speed Stereo Vision System John Iselin Woodﬁll, Gaile Gordon, Ron Buck Tyzx, Inc. 3885 Bohannon Drive Menlo Park, CA 94025 {Woodﬁll, Gaile, Ron}@tyzx.com Abstract This paper describes the DeepSea Stereo Vision System which makes the use of high speed 3D images practical in many application domains. This system is based on the DeepSea processor, an ASIC, which computes absolute depth based on simultaneously captured left and right im- ages with high frame rates, low latency, and low power. The chip is capable of running at 200 frames per second with 512x480 images, with only 13 scan lines latency between data input and ﬁrst depth output. The DeepSea Stereo Vi- sion System includes a stereo camera, onboard image rec- tiﬁcation, and an interface to a general purpose processor over a PCI bus. We conclude by describing several applica- tions implemented with the DeepSea system including per- son tracking, obstacle detection for autonomous navigation, and gesture recognition. 1. Introduction Many image processing applications require or are greatly simpliﬁed by the availability of 3D data. This rich data source provides direct absolute measurements of the scene. Object segmentation is simpliﬁed because discontinuities in depth measurements generally coincide with object borders. Simple transforms of the 3D data can also provide alterna- tive virtual viewpoints of the data, simplifying analysis for some applications. Stereo depth computation, in particular, has many ad- vantages over other 3D sensing methods. First, stereo is a passive sensing method. Active sensors, which rely on the projection of some signal into the scene, often pose high power requirements or safety issues under certain operating conditions. They are also detectable - an issue in security or defense applications. Second, stereo sensing provides a color or monochrome image which is exactly (inherently) registered to the depth image. This image is valuable in im- age analysis, either using traditional 2D methods, or novel methods that combine color and depth image data. Third, the operating range and Z resolution of stereo sensors are ﬂexible because they are simple functions of lens ﬁeld-of- view, lens separation, and image size. Almost any operat- ing parameters are possible with an appropriate camera con- ﬁguration, without requiring any changes to the underlying stereo computation engine. Fourth, stereo sensors have no moving parts, an advantage for reliability. High frame rate and low latency are critical factors for many applications which must provide quick decisions based on events in the scene. Tracking moving objects from frame to frame is simpler at higher frame rates because rel- ative motion is smaller, creating less tracking ambiguity. In autonomous navigation applications, vehicle speed is lim- ited by the speed of sensors used to detect moving obstacles. A vehicle traveling at 60 mph covers 88 ft in a second. An effective navigation system must monitor the vehicle path for new obstacles many times during this 88 feet to avoid collisions. It is also critical to capture 3D descriptions of potential obstacles to evaluate their location and trajectory relative to the vehicle path and whether their size represents a threat to the vehicle. In safety applications such as airbag deployment, the 3D position of vehicle occupants must be understood to determine whether an airbag can be safely deployed - a decision that must be made within tens of mil- liseconds. Computing depth from two images is a computation- ally intensive task. It involves ﬁnding, for every pixel in the left image, the corresponding pixel in the right image. Correct corresponding pixel is deﬁned as the pixel repre- senting the same physical point in the scene. The distance between two corresponding pixels in image coordinates is called the disparity and is inversely proportional to distance. In other words, the nearer a point is to the sensor, the more it will appear to shift between left and right views. In dense stereo depth computation, ﬁnding a pixel’s corresponding pixel in the other image requires searching a range of pix- els for a match. As image size, and therefore pixel density, increases, the number of pixel locations searched must in- crease to retain the same operating range. Therefore, for an NxN image, the stereo computation is approximately Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW’04) 1063-6919/04 $ 20.00 IEEE