Registration of IR and EO Video Sequences based on Frame Difference Zheng Liu and Robert Lagani` ere School of Information Technology and Engineering University of Ottawa 800 King Edward Ave. Ottawa, ON K1N6N5 Canada E-mail: laganier@site.uottawa.ca Abstract Multi-modal imaging sensors are employed in advanced surveillance systems in the recent years. The performance of surveillance systems can be enhanced by using infor- mation beyond the visible spectrum, for example, infrared imaging. To ensure correctness of low- or high-level pro- cessing, multi-modal imagers must be fully calibrated or registered. In this paper, an algorithm is proposed to reg- ister the video sequences acquired by an infrared and an electro-optical (CCD) camera. The registration method is based on the silhouette extracted by differencing adjacent frames. This difference is found by an image structural sim- ilarity measurement. Initial registration is implemented by tracing the top head points in consecutive frames. Finally, an optimization procedure to maximize mutual information is employed to refine the registration results. 1. Introduction Vision systems working beyond visible spectrum are be- coming affordable assets to advanced surveillance systems. The performance of these systems can be enhanced through taking full advantage of the information available across the electromagnetic spectrum. This makes the surveil- lance system more robust and reliable under different con- ditions, such as noisy and cluttered background, poor light- ing, smoke, and fog. The technique to achieve this goal is known as information or sensor fusion. Depending on the requirements, the fusion of multi-modal images can be implemented at different levels using various fusion algo- rithms [1, 4]. The infrared (IR) camera uses thermal detector to mea- sure the difference in infrared radiation of different objects, i.e. the variance of thermal emissivity properties. The electro-optical (EO) sensor, e.g. CCD or CMOS cameras, captures the reflective light properties of objects [6]. There- fore, the visual and IR imagery provide the complementary information about the scene [6]. Multiple cues provided by the two imaging modalities can be used to achieve de- tection, tracking, and content analysis for the surveillance applications. However, preceding to any further processing, the EO and IR images from the video sequences should be registered so that the corresponding pixels in the two images are associated with the same physical points in the scene. This ensures the correctness of pixel- and high-level pro- cessing. The image registration consists of four basic steps: fea- ture detection, feature matching, mapping function design, and image transformation and resampling [12]. Li et al. registered multi-sensor images with image contours [7]. In another publication of Li et al. [8], they used a wavelet- based approach to detect image contour and located fea- ture points by using local statistics of image intensity. The feature points were matched with a normalized correlation method. Coirs et al. matched the triangles formed by grouped straight line segments extracted from the IR and EO images [3]. However, the physical correspondences may not be fully detected with matchable contours or lines. The same scene may appear totally different in two image modalities. Han et al. suggested using the silhouette of a moving human body to register IR and EO images. They found the silhouette by classifying a pixel as belonging to either foreground or background based on the background Gaussian distribution [5]. The centroid and head top points in two pairs of images were used as control points. A ge- netic algorithm was employed to minimize the registration error function. In [11], Ye et al. proposed using zero- order statistics to detect moving object in a video sequence. Through tracking the feature points, an iterative registration algorithm is implemented. Related work was also reported by Maes et al. and Chen et al. in [9, 2], where the reg- istration is carried out based on maximizing mutual infor- mation of two image regions. However, in [2] the images