Tunable Kernels for Tracking Vasu Parameswaran, Visvanathan Ramesh, Imad Zoghlami Real-Time Vision and Modeling Department Siemens Corporate Research Princeton, NJ 08540 Abstract We present a tunable representation for tracking that si- multaneously encodes appearance and geometry in a man- ner that enables the use of mean-shift iterations for track- ing. The classic formulation of the tracking problem us- ing mean-shift iterations encodes spatial information very loosely (i.e. using radially symmetric kernels). A prob- lem with such a formulation is that it becomes easy for the tracker to get confused with other objects having the same feature distribution but different spatial conﬁgurations of features. Subsequent approaches have addressed this issue but not to the degree of generality required for tracking spe- ciﬁc classes of objects and motions (e.g. humans walking). In this paper, we formulate the tracking problem in a man- ner that encodes the spatial conﬁguration of features along with their density and yet retains robustness to spatial de- formations and feature density variations. The encoding of spatial conﬁguration is done using a set of kernels whose parameters can be optimized for a given class of objects and motions, off-line. The formulation enables the use of mean- shift iterations and runs in real-time. We demonstrate better tracking results on synthetic and real image sequences as compared to the original mean-shift tracker. 1. Introduction We are interested in real-time object tracking, which re- mains a challenging problem and is of particular relevance in today’s emerging application domains such as visual sur- veillance, driver assistance etc. A crucial component in a solution to tracking is object representation, where a key challenge is to capture the ‘right’ amount of variability of the object. Too much rigidity (e.g. template based ap- proaches) or too much ﬂexibility (e.g. feature-histogram based approaches) will restrict the environments where a tracker can work reliably. The ‘right’ amount of variability naturally depends on the speciﬁc types of motion and the class of object being tracked. In this work, we are inter- ested in the best way to use this type of apriori knowledge for target representation: speciﬁcally, how to one encode variability, and how to learn this variability automatically. We focus on the mean-shift tracker, originally proposed in [5]. Key advantages of the tracker include fast operation, robustness and invariance to a large class of object deforma- tions. A large body of work followed [5] exploring various related aspects such as feature spaces (e.g. [2], [11]), en- coding of spatial information (e.g. recently [14], [1]), shape adaptation (e.g. [13], [15]) etc. The representation chosen in the original formulation is a weighted feature histogram, where each pixel is weighted by a radially symmetric ker- nel that depends upon its normalized spatial distance from the object center (i.e. a kernel modulated histogram). Use of a radially symmetric kernel renders the representation in- variant to a large set of transformations (any transformation that preserves the distance of a pixel from the center - e.g. rotations). While the weighting scheme may be appropriate if nothing apriori were known about the object or types of motion that it can undergo, this large amount of invariance poses problems when the object moves close to a region having a similar feature histogram but very different spa- tial conﬁguration of features, resulting in multiple peaks for the cost function being maximized, and confusion for the tracker. A second issue is that of bandwidth selection for the spatial modulation. Though a signiﬁcant amount of work has addressed the issue of bandwidth selection for segmen- tation problems (e.g. [3], [12]) it is not clear how it could be adapted to encode acceptable deformations of a target. A number of papers have addressed the issue of encod- ing spatial information into the representation. In the area of image retrieval, the multiresolution histogram [7] offers im- plicit encoding of spatial information. In the area of track- ing, the following papers describe approaches for incorpo- rating spatial information: Hager et. al. [8] analyze the types of motion that the kernel-modulated histogram is in- variant to, and propose distributing kernels spatially to cap- ture enough information to recover speciﬁc kinds of object motion (e.g. rotation). ‘Color correlograms’ are used in [14] to capture the cooccurrences of pairs of colors sepa- rated by speciﬁc distances along orthogonal directions. The