Backtracking: Retrospective multi-target tracking q W.P. Koppen ⇑ , M. Worring Intelligent Systems Lab Amsterdam, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands article info Article history: Received 30 August 2010 Accepted 23 April 2012 Available online 18 May 2012 Keywords: Surveillance Tracking Forensics A / abstract We introduce a multi-target tracking algorithm that operates on prerecorded video as typically found in post-incident surveillance camera investigation. Apart from being robust to visual challenges such as occlusion and variation in camera view, our algorithm is also robust to temporal challenges, in particular unknown variation in frame rate. The complication with variation in frame rate is that it invalidates motion estimation. As such, tracking algorithms based on motion models will show decreased perfor- mance. On the other hand, appearance based detection in individual frames suffers from a plethora of false detections. Our tracking algorithm, albeit relying on appearance based detection, deals robustly with the caveats of both approaches. The solution rests on the fact that for prerecorded video we can make fully informed choices; not only based on preceding, but also based on following frames. We start off from an appearance based object detection algorithm able to detect in each frame all target objects. From this we build a graph structure. The detections form the graph’s nodes and the vertices are formed by connecting each detection in a frame to all detections in the following frame. Thus, each path through the graph shows some particular selection of successive detections. Tracking is then reformulated as a heuristic search for optimal paths, where optimal means to ﬁnd all detections belonging to a single object and excluding any other detection. We show that this approach, without an explicit motion model, is robust to both the visual and temporal challenges. Ó 2012 Elsevier Inc. All rights reserved. 1. Introduction Surveillance cameras are at present a widespread tool for the observation of large areas, allowing a single ofﬁcer to monitor mul- tiple locations at once. This live monitoring is mainly used for early intervention and the prevention of street crime. Another use of the video, the one this paper focuses on, is when previously recorded video is retrieved and reviewed for evidence in forensic cases or any other post-incident investigation. In such cases it is common to have large amounts of video data which need to be reviewed in their entirety. Clearly, the most important subjects of observation are people. When we would be able to automatically ﬁnd and track all re- corded persons, it would greatly alleviate the exhausting process of video reviewing, but tracking all persons is a challenging task. One of the main problems of multiple person tracking is that peo- ple may occlude each other, in which case it is difﬁcult for a com- puter to tell them apart. Another major challenge, and very common in operated surveillance video, is that the camera may pan, tilt, and zoom. Such operations drastically alter the perceptual location of all objects (their xy-position within the frame), and thus trajectories become more chaotic. This effect is ampliﬁed by the fact that, mostly to save bandwidth, many cameras use a variable recording frame rate. In other words, the elapsed time between any two successive frames varies. All those aspects raise severe dif- ﬁculties in the prediction of object location, and thus in tracking persons. In one of the earliest approaches towards multiple target track- ing [24], alternative hypotheses about the conﬁguration of tracks are maintained. Hence the name multiple hypothesis tracking (MHT). In MHT, a graph is constructed where each node is a hypothesis (a possible track) and the edges show how a hypothesis can change with the addition of new object detections. Thus in this graph each hypothesis that is currently not the most likely is still stored for a later moment, with the idea that future information may shed new light on which conﬁguration is actually the most likely. Essentially this allows it to recover from tracking errors and it is particularly helpful for dealing with occlusion. However, for each new detection all leaves in the tree shaped graph need to be split into two new options: the track with the new detection added and the track without it. So with every single new detection the number of leaves doubles and the tree grows exponential in the number of detections. For large videos, such as surveillance videos, this process becomes computationally prohibitive. To alleviate the situation some works try to reduce the temporal context or prune 1077-3142/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.cviu.2012.04.004 q This paper has been recommended for acceptance by R. Bergevin. ⇑ Corresponding author. E-mail addresses: uva@paulkoppen.com (W.P. Koppen), m.worring@uva.nl (M. Worring). Computer Vision and Image Understanding 116 (2012) 967–980 Contents lists available at SciVerse ScienceDirect Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu