LONG AND SHORT MEMORY BALANCING IN VISUAL CO-TRACKING USING Q-LEARNING Kourosh Meshgi ⋆ Maryam Sadat Mirzaei ⋆† Shigeyuki Oba ⋆ ⋆ Graduate School of Informatics, Kyoto University, Japan † RIKEN Center for Advanced Intelligence Project (AIP), Japan ABSTRACT Employing one or more additional classiﬁers to break the self- learning loop in tracing-by-detection has gained considerable attention. Most of such trackers merely utilize the redundancy to address the accumulating label error in the tracking loop, and suffer from high computational complexity as well as tracking challenges that may interrupt all classiﬁers (e.g. tem- poral occlusions). We propose the active co-tracking frame- work, in which the main classiﬁer of the tracker labels sam- ples of video sequence, and only consults auxiliary classiﬁer when it is uncertain. Based on the source of the uncertainty and the differences of two classiﬁers (e.g. accuracy, speed, update frequency, etc.), different policies should be taken to exchange the information between two classiﬁers. Here, we introduce a reinforcement learning approach to ﬁnd the appro- priate policy by considering the state of the tracker in a spe- ciﬁc sequence. The proposed method yields promising results in comparison to the best tracking-by-detection approaches. Index Terms— visual tracking, active learning, Q- learning, mixture-of-memories 1. INTRODUCTION Tracking-by-detection methods are built around the idea that a single classiﬁer separates the target from its background by labeling (or ﬁltering) several samples from the input image, labeling them, and extrapolating these samples to estimate the current target location and size. This classiﬁer needs to be updated to cope with recent target transformations as well as other challenging factors such as changes in illumination, camera pose, cluttered background, and occlusions. The up- date process is mainly done using the labels that the classiﬁer selected for the samples, in a self-supervised learning fashion. A classiﬁer is not always certain about the output labels. Whether it is inefﬁcient features for certain input images, in- sufﬁcient model complexity to separate some of the samples, lack of proper training data, missing information in the input data (e.g., due to occlusion), or technically speaking, having This article is based on results obtained from a project commissioned by the NEDO and was supported by Post-K application development for ex- ploratory challenges from Japan’s MEXT. Fig. 1. Consider a classiﬁer of tracking-by-detection that uses color and shape features and is trained on video frames lead- ing to the frame on the left column. When classifying n s samples from the frame in the middle column, the uncertainty for all samples may have different trends, as plotted in the uncertainty histogram in right panels. The histogram may be skewed toward certainty, uncertainty (e.g. due to feature failures or occlusion), bimodal (where usually background is easy to separate but the foreground is ambiguous), etc. In co- tracking frameworks, various patterns of uncertainty require different policies to enhance tracking performance. an input sample that falls very close to decision boundary of the classiﬁer, hampers the classiﬁer ability to be sure about its label and increase the risk of misclassiﬁcation. Especially in the case of online learning, novel appearances of the tar- get, background distractors, and non-stationarity of the label distribution 1 promotes the uncertainty of the classiﬁer. Furthermore, the self-supervised learning loop may lead to model drift due to the accumulation of label errors, and many studies have tried to tackle this problem by using ro- bust loss functions for the classiﬁer [1], merging the sampling and learning [2], and employing unlabeled data [3]. One of 1 A sample might be considered as foreground but later the label become obsolete or become a part of the background. arXiv:1902.05211v1 [cs.CV] 14 Feb 2019