Eﬃcient Visual Object Tracking with Online Nearest Neighbor Classiﬁer Steve Gu and Ying Zheng and Carlo Tomasi Department of Computer Science, Duke University Abstract. A tracking-by-detection framework is proposed that com- bines nearest-neighbor classiﬁcation of bags of features, eﬃcient subwin- dow search, and a novel feature selection and pruning method to achieve stability and plasticity in tracking targets of changing appearance. Ex- periments show that near-frame-rate performance is achieved (sans fea- ture detection), and that the state of the art is improved in terms of handling occlusions, clutter, changes of scale, and of appearance. A the- oretical analysis shows why nearest neighbor works better than more sophisticated classiﬁers in the context of tracking. 1 Introduction Visual object tracking is crucial to visual understanding in general, and to many computer vision applications ranging from surveillance and robotics to gesture and motion recognition. The state of this art has advanced signiﬁcantly in the past 30 years [1–8]. Recently, advances in apparently unrelated areas have given tracking a fresh impulse: Speciﬁcally, progress in the deﬁnition of features invari- ant to various imaging transformations [9, 10], online learning [11, 12], and object detection [13–16] have spawned the approach of tracking by detection [17–21], in which a target object identiﬁed by the user in the ﬁrst frame is described by a set of features. A separate set of features describes the background, and a bi- nary classiﬁer separates target from background in successive frames. To handle appearance changes, the classiﬁer is updated incrementally over time. Motion constraints restrict the space of boxes to be searched for the target. In a recent example of this approach, Babenko et al. [20] adapt Multiple Instance Learning (MIL) [12, 11] by building an evolving boosting classiﬁer that tracks bags of image patches, and report excellent tracking results on challeng- ing video sequences. The main advantages of tracking by detection come from the ﬂexibility and resilience of its underlying representation of appearance. Sev- eral parametric learning techniques such as Support Vector Machines (SVM, [22]), boosting [20], generative models [23], and fragments [24] have been used successfully in tracking by detection. More recently, Santner et al. propose a sophisticated tracking system called PROST [21] that achieves top performance with a smart combination of three trackers: template matching based on normal- ized cross correlation, mean shift optical ﬂow [25], and online random forests [26] to predict the target location.