Video-based Descriptors for Object Recognition Taehee Lee, Stefano Soatto Computer Science Department, University of California, Los Angeles, CA 90095, USA Abstract We describe a visual recognition system operating on a hand-held device, based on a video-based feature descriptor, and characterize its invariance and discriminative properties. Feature selection and tracking are performed in real-time, and used to train a template-based classifier during a capture phase prompted by the user. During normal operation, the system scores objects in the field of view based on their ranking. Severe resource constraints have prompted a re-evaluation of existing algorithms improving their perfor- mance (accuracy and robustness) as well as computational efficiency. We motivate the design choices in the implementation with a characterization of the stability properties of local invariant detectors, and of the conditions under which a template-based descriptor is optimal. The analysis also highlights the role of time as “weak supervisor” during training, which we exploit in our implementation. Keywords: feature tracking, video-based descriptors, object recognition, mobile devices, visual recognition, active vision 1. Introduction We tackle the problem of recognizing objects and scenes from images, given example views. The diffi- culty of this problem is the large nuisance variabil- ity that the data can exhibit, depending on the van- tage point, visibility conditions (occlusions), illumi- nation, etc. under which the object is seen, even if it does not exhibit intrinsic variability. The analy- sis in [1] suggests that the nuisances induce almost all the variability in the data, and what remains (the dependency of the data on the object) is sup- ported on a thin set. The most common approach to this problem is to eliminate some of the nuisances by pre-processing the data (to obtain “distinctive” and yet “insensitive” features), and to “learn away” the residual nuisance variability, often using a train- ing set of manually labeled images. Both practices are poorly grounded in principle: Pre-processing does not, in general, improve the performance in a classification task (cfr. the data processing inequal- ity [2]); Training a classifier using unrelated images (aiming to approximate independent samples from the class-conditional distribution) brings into ques- tion the fact that there is a scene out there, and limit the classifier to learning generic regularities in images. It can be shown that, when a collection of passively gathered independent snapshots are used as training set, not only is the worst-case error in a visual recognition problem at chance level (i.e. the risk is the same that is offered by the prior), but so is the average case [3]. This is not the case, how- ever, when the training data consists of purpose- fully captured images during an active exploration phase [4]. In this paper we propose a different approach to recognition, grounded in the ideas of Active Vision [5, 6] and Actionable Information [4], whereby the training set consists not of isolated snapshots, such as photo collections harvested from the web, but of temporally coherent sequences of images where the user is free to move around an object or manipulate it. Even if the objects are static, it can be shown that the presence of video results in quantifiably superior recognition performance in a single (test) image. More importantly, the issue of representa- tion is well grounded in the presence of multiple images of the same scene, and temporal continuity provides the crucial “bit” that the images in the training set are of the same scene, and therefore all the variability in the data is ascribed to the nui- sances. Contrary to common perception, building repre- Preprint submitted to Image and Vision Computing February 20, 2011