High-Speed Pose and Velocity Measurement from Vision Redwan Dahmouche, Omar Ait-Aider, Nicolas Andreff and Youcef Mezouar LASMEA - CNRS - Universit´ e Blaise Pascal 63175 Aubi` ere, France {firstname.lastname}@lasmea.univ-bpclermont.fr Abstract— This paper presents a novel method for high speed pose and velocity computation from visual sensor. The main problem in high speed vision is the bottleneck phenomenon which limits the video rate transmission. The proposed ap- proach circles the problem out by increasing the information density instead of the data rate transmission. This strategy is based on a rotary sequential acquisition of selected regions of interest (ROI) which provides space-time data. This acquisition mode induces an image projection deformation of dynamic objects. This paper shows how to use this artifact for the simultaneous measure of both pose and velocity, at the same frequency as the ROI’s acquisition one. I. INTRODUCTION Vision is used at several levels in robotics, particularly in localization, identiﬁcation [1] and control [2], [3]. However the slow rate of video sensors is an evident drawback in high sampling frequency applications. Indeed, standard high-speed cameras video rate is about 120Hz while high speed dynamic control application runs typically at 1kHz. Nevertheless, it has been reported that high-speed vision could be used in dynamic control of serial robots [4], where a General Predictive Control (GPC) scheme was associated to a visual loop linearisation to adapt the video rate (120 Hz) to the control sampling frequency (500 Hz). However, this solution increases control complexity. An alternative solution is to increase the video rate to reach the system sampling frequency. To do so, different approaches have been presented in the literature. Usually, camera video rate is limited by the transmission interface bandwidth. Reducing the image resolution to de- crease the video ﬂow tightens a lot the ﬁeld of view of the camera for a given accuracy of the end-effector pose estima- tion. To solve this problem, different approaches are possible such as video rate increasing by developing a more efﬁcient video compression [5], creating faster transmission interfaces (for instance, CamLink) or embedding the signal processing close to the acquisition system [6], [7]. Nevertheless, we believe that the optimal solution is to increase the video ﬂow information density. Indeed, the current approach in vision based applications is to grab and transmit the whole image, to extract interesting features to process and to throw away the rest of the image. For instance, to provide vision based pose estimation of a moving object from a single image, This work was supported by R´ egion d’Auvergne through the Innov@pˆ ole project and by the European Union through the In- tegrated Project NEXT no. 0011815. four non degenerate point projections are enough [8]. The ratio between the amount of data needed to perform the pose estimation and the transmitted ﬂow of acquired image of size S is given by 4×2×precision size S×unsigned char size . For a mega-pixel image size the ratio is 6.4 10 -5 . The transmitted data is bigger than 1.5 10 4 times the needed amount. Instead of transmitting the whole image and then selecting a regions of interest (ROI), it is more interesting, from the data ﬂow and the ‘silicium cost’ points of view, to inverse the process by ﬁrst selecting the ROI position, and then to transmit it. Note that in the two cases, the ROI positions are predicted, so there is no difference between the two approaches if the rest of the image is not used. This acquisition mode was proposed in [9] where a new CMOS camera was designed to grab a simultaneously multiple ROIs. The same approach can be performed using an ”off- the-shelf” CMOS fast reconﬁgurable camera which uses the CamLink interface. A single rectangular area can be selected for shuttering and transmitting. Its parameters can be changed dynamically at each acquisition. By grabbing only areas in the scene that contain information, such as interest points or blobs (Figure 1), the information density in the video ﬂow is increased. The direct effect of this is that the ROI acquisition frequency can be multiplied by the ratio of the full image size on the size of the grabbed area. For instance, transmitting ten regions of interest of 10 × 10 pixels that contain the desired information, instead of the 1024 × 1024 pixels image size, reduces the data ﬂow from 1M pixels to 1K pixels, and theoretically multiplies the acquisition frequency by 1000. In practice, transmission control bits, parameters setting and exposure time limits the video rate. Note that the exposure time and the acquisition frequency can also be controlled. Unfortunately, sequential acquisition of partial areas on the retina introduces time delay between acquisitions and affects image projection of moving objects. Thus, classical pose estimation algorithm can not be used in this case. In addition, these methods enable only to estimate successive poses. The velocity information is generally retrieved by numerically differentiating the pose measurements. This introduces addi- tional noise. To compute pose and velocity at each sample, one ap- proach consists in using data fusion methods from a set of partial time varying information (eg. Kalman ﬁlter [6]). However this approach assumes a Gaussian noise which is not guaranteed in pose measurement applications. To 2008 IEEE International Conference on Robotics and Automation Pasadena, CA, USA, May 19-23, 2008 978-1-4244-1647-9/08/$25.00 ©2008 IEEE. 107