Mutual information based registration of multimodal stereo videos for person tracking q Stephen J. Krotosky * , Mohan M. Trivedi Computer Vision and Robotics Research Laboratory, University of California, San Diego, 9500 Gilman Dr. 0434, La Jolla, CA 92093-0434, USA Received 15 September 2006; accepted 23 October 2006 Available online 20 December 2006 Communicated by James Davis and Riad Hammoud Abstract Research presented in this paper deals with the systematic examination, development, and evaluation of a novel multimodal registra- tion approach that can perform accurately and robustly for relatively close range surveillance applications. An analysis of multimodal image registration gives insight into the limitations of assumptions made in current approaches and motivates the methodology of the developed algorithm. Using calibrated stereo imagery, we employ maximization of mutual information in sliding correspondence win- dows that inform a disparity voting algorithm to demonstrate successful registration of objects in color and thermal imagery. Extensive evaluation of scenes with multiple objects at different depths and levels of occlusion shows high rates of successful registration. Ground truth experiments demonstrate the utility of the disparity voting techniques for multimodal registration by yielding qualitative and quan- titative results that outperform approaches that do not consider occlusions. A basic framework for multimodal stereo tracking is inves- tigated and promising experimental studies show the viability of using registration disparity estimates as a tracking feature. Ó 2007 Elsevier Inc. All rights reserved. Keywords: Thermal infrared sensing; Multisensor fusion; Person tracking; Visual surveillance; Situational awareness 1. Introduction A fundamental issue associated with multisensory vision is that of accurately registering corresponding information and features from the different sensory systems. This issue is exasperated when the sensors are capturing signals derived from totally different physical phenomena, such as color (reflected energy) and thermal signature (emitted energy). Multimodal imagery applications for human anal- ysis span a variety of application domains, including med- ical [1], in-vehicle safety systems [2] and long-range surveillance [3]. The combination of both types of imagery yields information about the scene that is rich in color, depth, motion and thermal detail. Once registered, such information can then be used to successfully detect, track and analyze movement and activity patterns of persons and objects in the scene. At the heart of any registration approach is the selection of the most relevant similarity metric, which can accurately match the disparate physical properties manifested in images recorded by multimodal cameras. Mutual Informa- tion (MI) provides an attractive metric for situations where there are complex mappings of the pixel intensities of cor- responding objects in each modality, due to the disparate physical mechanisms that give rise to the multimodal imag- ery [4]. Egnal has shown that mutual information is a via- ble similarity metric for multimodal stereo registration when the mutual information window sizes are large enough to sufficiently populate the joint probability histo- gram of the mutual information computation [5]. Further investigations into the properties and applicability of mutual information for windowed correspondence measure has been done by Thevenaz and Unser [6]. Challenges lie in obtaining these appropriately sized window regions for 1077-3142/$ - see front matter Ó 2007 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2006.10.008 q This research is sponsored by the Technical Support Working Group (TSWG) for Combating Terrorism, DHS and the U.C. Discovery Grant. * Corresponding author. E-mail addresses: krotosky@ucsd.edu (S.J. Krotosky), mtrivedi@ ucsd.edu (M.M. Trivedi). www.elsevier.com/locate/cviu Computer Vision and Image Understanding 106 (2007) 270–287