Online Adaptation for Joint Scene and Object Classiﬁcation Jawadul H. Bappy, Sujoy Paul and Amit K. Roy-Chowdhury Dept. of ECE, University of California, Riverside, CA 92521 {mbappy,supaul,amitrc}@ece.ucr.edu Abstract. Recent eﬀorts in computer vision consider joint scene and object classiﬁcation by exploiting mutual relationships (often termed as context) between them to achieve higher accuracy. On the other hand, there is also a lot of interest in online adaptation of recognition models as new data becomes available. In this paper, we address the problem of how models for joint scene and object classiﬁcation can be learned online. A major motivation for this approach is to exploit the hierarchical relationships between scenes and objects, represented as a graphical model, in an active learning framework. To select the samples on the graph, which need to be labeled by a human, we use an information theoretic approach that reduces the joint entropy of scene and object variables. This leads to a signiﬁcant reduction in the amount of manual labeling eﬀort for similar or better performance when compared with a model trained with the full dataset. This is demonstrated through rigorous experimentation on three datasets. Keywords: Scene Classiﬁcation, Object detection, Active learning 1 Introduction Scene classiﬁcation and object detection are two challenging problems in com- puter vision due to high intra-class variance, illumination changes, background clutter and occlusion. Most existing methods assume that data will be labeled and available beforehand in order to train the classiﬁcation models. It becomes infeasible and unrealistic to know all the labels beforehand with the huge corpus of visual data being generated on a daily basis. Moreover, adaptability of the models to the incoming data is crucial too for long-term performance guaran- tees. Currently, the big datasets (e.g. ImageNet [1], SUN [2]) are prepared with intensive human labeling, which is diﬃcult to scale up as more and more new images are generated. So, we want to pose a question, ‘Are all the samples equally important to manually label and learn a model? ’. We address this question in the context of joint scene and object classiﬁcation. Active learning [3] has been widely used to choose a subset of most informative samples that can achieve similar or better performance than all the data being manually labeled. In order to identify the informative samples, most active learn- ing techniques choose the samples about which the classiﬁer is most uncertain.