LONG AND SHORT MEMORY BALANCING IN VISUAL CO-TRACKING USING Q-LEARNING Kourosh Meshgi ⋆ Maryam Sadat Mirzaei ⋆† Shigeyuki Oba ⋆ ⋆ Graduate School of Informatics, Kyoto University, Japan † RIKEN Center for Advanced Intelligence Project (AIP), Japan ABSTRACT Employing one or more additional classifiers to break the self- learning loop in tracing-by-detection has gained considerable attention. Most of such trackers merely utilize the redundancy to address the accumulating label error in the tracking loop, and suffer from high computational complexity as well as tracking challenges that may interrupt all classifiers (e.g. tem- poral occlusions). We propose the active co-tracking frame- work, in which the main classifier of the tracker labels sam- ples of video sequence, and only consults auxiliary classifier when it is uncertain. Based on the source of the uncertainty and the differences of two classifiers (e.g. accuracy, speed, update frequency, etc.), different policies should be taken to exchange the information between two classifiers. Here, we introduce a reinforcement learning approach to find the appro- priate policy by considering the state of the tracker in a spe- cific sequence. The proposed method yields promising results in comparison to the best tracking-by-detection approaches. Index Terms— visual tracking, active learning, Q- learning, mixture-of-memories 1. INTRODUCTION Tracking-by-detection methods are built around the idea that a single classifier separates the target from its background by labeling (or filtering) several samples from the input image, labeling them, and extrapolating these samples to estimate the current target location and size. This classifier needs to be updated to cope with recent target transformations as well as other challenging factors such as changes in illumination, camera pose, cluttered background, and occlusions. The up- date process is mainly done using the labels that the classifier selected for the samples, in a self-supervised learning fashion. A classifier is not always certain about the output labels. Whether it is inefficient features for certain input images, in- sufficient model complexity to separate some of the samples, lack of proper training data, missing information in the input data (e.g., due to occlusion), or technically speaking, having This article is based on results obtained from a project commissioned by the NEDO and was supported by Post-K application development for ex- ploratory challenges from Japan’s MEXT. Fig. 1. Consider a classifier of tracking-by-detection that uses color and shape features and is trained on video frames lead- ing to the frame on the left column. When classifying n s samples from the frame in the middle column, the uncertainty for all samples may have different trends, as plotted in the uncertainty histogram in right panels. The histogram may be skewed toward certainty, uncertainty (e.g. due to feature failures or occlusion), bimodal (where usually background is easy to separate but the foreground is ambiguous), etc. In co- tracking frameworks, various patterns of uncertainty require different policies to enhance tracking performance. an input sample that falls very close to decision boundary of the classifier, hampers the classifier ability to be sure about its label and increase the risk of misclassification. Especially in the case of online learning, novel appearances of the tar- get, background distractors, and non-stationarity of the label distribution 1 promotes the uncertainty of the classifier. Furthermore, the self-supervised learning loop may lead to model drift due to the accumulation of label errors, and many studies have tried to tackle this problem by using ro- bust loss functions for the classifier [1], merging the sampling and learning [2], and employing unlabeled data [3]. One of 1 A sample might be considered as foreground but later the label become obsolete or become a part of the background. arXiv:1902.05211v1 [cs.CV] 14 Feb 2019