Pedestrian Association and Localization in Monocular FIR Video Sequence Mayank Bansal, Shunguang Wu, Jayan Eledath Sarnoff Corporation 201 Washington Rd, Princeton, NJ, USA {mbansal,swu,jeledath}@sarnoff.com Abstract This paper addresses the frame-to-frame data associ- ation and state estimation problems in localization of a pedestrian relative to a moving vehicle from a monocular far infra-red video sequence. Using a novel application of the hierarchical model-based motion estimation frame- work, we are able to use the image appearance informa- tion to solve the frame-to-frame data association problem and estimate a sub-pixel accurate height ratio for a pedes- trian in two frames. Then, to localize the pedestrian, we propose a novel approach of using the pedestrian height ratio estimates to guide an interacting multiple-hypothesis- mode/height ﬁltering algorithm instead of using a constant pedestrian height model. Experiments on several IR se- quences demonstrate that this approach achieves results comparable to those from a known pedestrian height thus avoiding errors from a constant height model based ap- proach. 1. Introduction In recent years, there has been an increased use of visual sensors in automotive safety and convenience ap- plications. One important safety application is to detect pedestrians[17] at night time. Visible-range cameras do not provide sufﬁcient contrast to detect pedestrians well - a problem which is well handled by near and far infra-red (NIR,FIR) cameras. FIR cameras carry the advantage of target heat sensitivity without the need for active ambient illumination. The images of vehicles, pedestrians and ani- mals are signiﬁcantly enhanced and are clearly visible under otherwise poor visibility conditions. Accurately estimating the 3D location of the pedestrian relative to the moving ve- hicle is important for accurate warnings. This is a chal- lenging problem as the system has to rely on the temporal tracking to estimate the location - both frame-to-frame data association as well as state-estimation ﬁltering become im- portant. In this paper, we will focus on the data-association and state-estimation aspects. In FIR imagery, the appearance of a pedestrian does not change much from frame-to-frame and it becomes possible to match a pedestrian across time. This temporal image- based matching approach helps the tracker by a) reducing the state-space and hence the complexity of the ﬁlter re- quired by not requiring an appearance model to be main- tained by the ﬁlter, b) providing an alternate more robust means for data-association in case of missed-detections and c) explicitly estimating a sub-pixel object size ratio (which we call scale) in the image between two frames. In this paper, we describe an application of the hierarchical model- based motion estimation paradigm of [4] to match pedes- trian appearance over time without explicitly modeling the pedestrian shape. The appearance matching is used, ﬁrst, to resolve the frame-to-frame association of the detections and then, to estimate the scale across time which allows a multiple-hypothesis-mode ﬁltering algorithm to be em- ployed for the state estimate phase. To obtain a more accurate 3D localization, instead of us- ing a constant H (one mode) for all pedestrians, this pa- per presents a multiple-hypothesis-mode ﬁltering algorithm where each mode assumes a potential discrete height value for the pedestrian and runs as a separate ﬁlter. The proba- bility of each ﬁlter is obtained by evaluating the likelihood value of an estimated pedestrian scale relative to the mea- sured scale from the appearance matcher. The ﬁnal pedes- trian location can be obtained either by combining the mode estimations together or just choosing the one with the high- est likelihood value. Related Work. Gandhi et al.[8] have given a compre- hensive survey of recent research on pedestrian collision avoidance systems. The paper reviews various approaches 1 38 978-1-4244-3993-5/09/$25.00 ©2009 IEEE