Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Steve Gu and Ying Zheng and Carlo Tomasi Department of Computer Science, Duke University Abstract. A tracking-by-detection framework is proposed that com- bines nearest-neighbor classification of bags of features, efficient subwin- dow search, and a novel feature selection and pruning method to achieve stability and plasticity in tracking targets of changing appearance. Ex- periments show that near-frame-rate performance is achieved (sans fea- ture detection), and that the state of the art is improved in terms of handling occlusions, clutter, changes of scale, and of appearance. A the- oretical analysis shows why nearest neighbor works better than more sophisticated classifiers in the context of tracking. 1 Introduction Visual object tracking is crucial to visual understanding in general, and to many computer vision applications ranging from surveillance and robotics to gesture and motion recognition. The state of this art has advanced significantly in the past 30 years [1–8]. Recently, advances in apparently unrelated areas have given tracking a fresh impulse: Specifically, progress in the definition of features invari- ant to various imaging transformations [9, 10], online learning [11, 12], and object detection [13–16] have spawned the approach of tracking by detection [17–21], in which a target object identified by the user in the first frame is described by a set of features. A separate set of features describes the background, and a bi- nary classifier separates target from background in successive frames. To handle appearance changes, the classifier is updated incrementally over time. Motion constraints restrict the space of boxes to be searched for the target. In a recent example of this approach, Babenko et al. [20] adapt Multiple Instance Learning (MIL) [12, 11] by building an evolving boosting classifier that tracks bags of image patches, and report excellent tracking results on challeng- ing video sequences. The main advantages of tracking by detection come from the flexibility and resilience of its underlying representation of appearance. Sev- eral parametric learning techniques such as Support Vector Machines (SVM, [22]), boosting [20], generative models [23], and fragments [24] have been used successfully in tracking by detection. More recently, Santner et al. propose a sophisticated tracking system called PROST [21] that achieves top performance with a smart combination of three trackers: template matching based on normal- ized cross correlation, mean shift optical flow [25], and online random forests [26] to predict the target location.