A New Analysis of the Value of Unlabeled Data in Semi-Supervised Learning for Image Retrieval Qi Tian, Jie Yu, Qing Xue, Nicu Sebe * Department of Computer Science, University of Texas at San Antonio, TX 78249 {qitian, jyu, qxue}@cs.utsa.edu * Faculty of Science, University of Amsterdam, The Netherlands nicu@science.uva.nl Abstract There has been an increasing interest in using unlabeled data in semi-supervised learning for various classification problems. Previous work shows that unlabeled data can improve or degrade the classification performance depending on whether the model assumption matches the ground-truth data distribution, and also on the complexity of the classifier compared with the size of the labeled training set. In this paper, we provide a new analysis on the value of unlabeled data by considering different distributions of the labeled and unlabeled data and showing the migrating effect for semi-supervised learning. Extensive experiments have been performed in the context of image retrieval application. Our approach evaluates the value of unlabeled data from a new aspect and is aimed to provide a guideline on how unlabeled data should be used. 1. Introduction Recently, there has been increasing interest in using unlabeled data for classification [1-8]. The motivation for this comes from the fact that labeled data is typically much harder to obtain compared to unlabeled data. This is valid in many applications, including web search, text classification, genetic research, and machine vision where an enormous amount of unlabeled data is available with little cost. There are two existing approaches of taking advantage of unlabeled data. The first one is semi-supervised learning [1-4] and the second one is active learning [5-8]. In semi-supervised learning, one trains a classifier based on the labeled data as well as unlabeled data. Typically, a coarse classifier is first trained on the smaller labeled data set, and then it is used to give probabilistic labels to the unlabeled data. Finally, the enlarged, or hybrid data set consisting of both labeled and unlabeled data with probabilistic labeling is applied to re-train the classifier. In active learning, the coarse classifier is still based on the labeled data set, but instead of having all the unlabeled data labeled by the coarse classifier, a set of “most- informative” unlabeled data is selected. This set is then labeled by a human, as is the case of the relevance feedback approach of content-based image retrieval (CBIR) [7, 8]. The added small set of unlabeled data is believed to greatly enhance the construction of the new classifier. The advantage of the active learning is that as little data as possible will be labeled to achieve the improved performance. There have been many studies on both active learning [5-8] and semi-supervised learning [1-4]. Past theoretical and experimental work showed that using the maximum- likelihood (ML) estimation approach (via EM or other numerical algorithms when unlabeled data was present) improved classification accuracy as more unlabeled data was added [2, 4]. Overall, these publications advance an optimistic view that unlabeled data can be profitably used wherever available. However, in [2, 3], there are also reports that unlabeled data degrades the performances when it is added, e.g., Hughes phenomenon in [3]. Recently, Cozman et al. [9] conducted experiments on synthetic data aimed at understanding the value of unlabeled data. They reported that the classification accuracy could degrade more and more as more unlabeled data is added. Cozman et al. found that the reason for the degradation is the mismatch of the model assumption and the ground truth data distribution. Considering all these aspects, several questions arise: when will unlabeled data help, and more importantly, how much do they help in classification and what are the underlying characteristics of the model that determines the usefulness of the unlabeled data? Conclusion from previous work [9, 1-4] on model assumption is that the ML estimator is unbiased and both labeled and unlabeled data contribute to a reduction in classification error by reducing variance as long as modeling assumptions match the ground-truth data. If model assumption does not match the ground-truth data, unlabeled data can improve or degrade the classification performance, depending on the complexity of the classifier compared with the size of the labeled training set [9, 4].