Weakly Supervised Object Localization with Inter-Intra Regulated CAMs Ziyi Kou, Guofeng Cui University of Rochester zkou2,gcui2@ur.rochester.edu Shaojie Wang Washington University in St. Louis swang115@ur.rochester.edu Wentian Zhao Adobe wzhao14@ur.rochester.edu Chenliang Xu University of Rochester Chenliang.Xu@rochester.edu Abstract Weakly supervised object localization (WSOL) aims to locate objects in images by learning only from image-level labels. Current methods are trying to obtain localization results relying on Class Activation Maps (CAMs). Usually, they propose additional CAMs or feature maps generated from internal layers of deep networks to encourage differ- ent CAMs to be either adversarial or cooperated with each other. In this work, instead of following one of the two main approaches before, we analyze their internal relation- ship and propose a novel intra-sample strategy which regu- lates two CAMs of the same sample, generated from differ- ent classifiers, to dynamically adapt each of their pixels in- volved in adversarial or cooperative process based on their own values. We mathematically demonstrate that our ap- proach is a more general version of the current state-of-the- art method with less hyper-parameters. Besides, we further develop an inter-sample criterion module for our WSOL task, which is originally proposed in co-segmentation prob- lems, to refine generated CAMs of each sample. The module considers a subgroup of samples under the same category and regulates their object regions. With experiment on two widely-used datasets, we show that our proposed method significantly outperforms existing state-of-the-art, setting a new record for weakly-supervised object localization. 1. Introduction Weakly-Supervised Object Localization has attracted ex- tensive research efforts in recent years [1, 3, 11, 12, 20, 21, 23, 27, 2, 35, 38]. It aims to infer object locations by only training with image-level labels rather than pixel-level an- notations, which can greatly reduce the cost of human la- bor in annotating images. This task is challenging since no guidance of target object position is provided. The most popular line of work tries to find cues from ex- Figure 1: Important components of our method. With a raw image in (a), the CAMs are generated by a deep con- volutional network. The feature map corresponding to the ground-truth label is shown in (b), from which we can ob- serve pixels with different values shown by different colors. The binary feature map in (c) reflects high confidence area with yellow color and the segmentation threshold is learned from the network, which varies with each sample. isting classification models. For example, Zhou et al. [45] introduce Global Average Pooling (GAP) layer to generate Class Activation Maps (CAMs) in top layers, which high- lights high-probability positions of target objects. However, CAMs can only detect the most discriminative part of an ob- ject, which is far from enough to cover the entire object for precise localization. Therefore, various methods are proposed to improve the power of CAMs. For examples, to enlarge the localized area from CAMs, some adversarial erasing approaches are proposed [40, 43, 35]. These methods usually need to build new CAMs based on original ones to search additional valu- able areas. To encourage new CAMs to focus on different regions, the common way is to erase part of the original image or internal feature map by directly manipulating the corresponding values. Despite the appealing idea, they add artifacts to image features and cannot guarantee a better re- arXiv:1911.07160v2 [cs.CV] 19 Nov 2019