Robust Visual Vocabulary Tracking Using Hierarchical Model Fusion Behzad Bozorgtabar 1 1 Vision & Sensing, HCC Lab, ESTeM University of Canberra Email: Behzad.Bozorgtabar@canberra.edu.au Roland Goecke 1,2 2 IHCC, RSCS, CECS Australian National University Email: roland.goecke@ieee.org Abstract—In this paper, we propose a new visual tracking approach based on the Hierarchical Model Fusion framework, which fuses two different trackers to cope with different tracking problems. We use an Incremental Multiple Principal Component Analysis tracker as our main model as well as an image patch tracker as our auxiliary model. Firstly, we randomly sample image patches within the target region obtained by the main model in the training frames for constructing a visual vocabulary using Histogram of Oriented Gradient features. Secondly, we use a supervised learning algorithm based on a Gaussian Mixture Model, which not only operates on supervised information to im- prove the discriminative power of the clusters, but also increases the purity of the clusters. Then, auxiliary models are initialised by obtaining conﬁdence scores of image patches based on the similarity between candidates and codewords. In addition, an updating procedure and a result reﬁnement scheme are included in the proposed tracking approach. Experiments on challenging video sequences demonstrate the robustness of the proposed approach to handling occlusion, pose variation and rotation. I. I NTRODUCTION Although visual tracking has been studied for many years [1]–[3], it is still challenging due to inevitable object appear- ance variation, rotation, cluttered and dynamic backgrounds, and occlusion. Therefore, the signiﬁcant feature of each track- ing system is deﬁning a good appearance model for the object. This paper presents a hybrid tracker consisting of a holistic appearance model – the main model – and a patch-based model – the auxiliary model. The main idea of this paper is how to combine and maintain appearance models obtained by the two different methods. We propose a tracking algorithm based on Hierarchical Model Fu- sion (HMF) [4] to link two object models probabilistically. The parameter update for each model takes place hierarchically. Firstly, we use a three-dimensional tensor based on the HSV colour map for an Incremental Multiple Principal Component (MPCA) tracker [5]. Then, we utilise similarities between image patche feature vectors (Histogram of Oriented Gradients (HOG) descriptors) and related codewords to develop a novel tracker based on the visual vocabulary. On the one hand, the main model tracker obtained by Incre- mental MPCA gives precise tracking results, when the target is subject to signiﬁcant variation in scaling and illumination. On the other hand, the image patch tracker is more competent in dealing with out-of-plane rotation than the region tracker. That is, when the target experiences occlusion or out-of-plane rotation and the main tracker drifts from the target centre, the probabilistically linked patch tracker assists the main model to locate the target. Online adaptation and result reﬁnement are also employed to obtain further improved results. II. RELATED WORK David et al. [6] present an efﬁcient and effective online algorithm that incrementally learns and adapts a low dimen- sional eigenspace representation to reﬂect appearance changes of the target, thereby facilitating the tracking task. Grabner et al. [7] present a method, which is able to do both – adjusting to the variations in appearance during tracking and selecting suitable features, which can learn any object and can discriminate it from the surrounding background. Markis et al. [4] use object models integrated by modifying the update equations of Bayesian ﬁlters. Each object model uses some of the cues to estimate its parameters and then is used to estimate a subset of the parameters of the resultant models. Nister et al. [8] use simple text-retrieval systems based on the analogy of ‘visual words’. Images are scanned for salient regions and a high-dimensional descriptor is computed for each region. These descriptors are then clustered into a vocabulary of visual words, and each salient region is mapped to the visual word closest to it under this clustering. Sivic et al. [9] use a ﬂat k-means clustering that was successful but difﬁcult to scale to large vocabularies. Mikolajczyk et al. [10] propose cluster hierarchies and greatly increase the visual word vocabulary size using them. Lian et al. [11] used a Gaussian Mixture Model to take advantage of a soft assignment and tried to maximise the discriminative ability of visual words using image labels. In order to modify the parameters of the Gaussian mixture, a supervised logistic regression model was used. Mairal et al. [12] jointly optimise a single sparse dictionary (using the L1 norm) and a classiﬁcation model in a mixed generative and discriminative formulation. Moosmann et al. [13] use randomised clustering forests based on supervised learning to build visual dictionaries. Without considering the class labels, these trees are used as simple spatial dividers that assign a distinct region label to each leaf. The drawback of this method is that it ignores the likelihood of the data, which is important in a Bag of Word (BoW) image representation. Fernando et al. [14] present an incremental gradient de- scent based clustering algorithm, which optimises the visual word creation by the use of the class label of training examples.