GLIPv2: Unifying Localization and VL Understanding Haotian Zhang *1 , Pengchuan Zhang *2†♠ , Xiaowei Hu 3 , Yen-Chun Chen 3 , Liunian Harold Li 4 Xiyang Dai 3 , Lijuan Wang 3 , Lu Yuan 3 , Jenq-Neng Hwang 1 , Jianfeng Gao 3 1 University of Washington, 2 Meta AI, 3 Microsoft, 4 UCLA {haotiz,hwang}@uw.edu,pengchuanzhang@fb.com,liunian.harold.li@cs.ucla.edu, {Xiaowei.Hu,Yen-Chun.Chen,Xiyang.Dai,lijuanw,luyuan,jfgao}@microsoft.com Abstract We present GLIPv2, a grounded VL understanding model, that serves both local- ization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly uni- fies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code is released at https://github.com/microsoft/GLIP. 1 Introduction Recently, a general interest arises in building general-purpose vision systems [24, 28, 66, 47], also called vision foundation models [6, 67], that solve various vision tasks simultaneously, such as image classification [35], object detection [44], and Visual-Language (VL) understanding [3, 11, 32]. Of particular interest, is the unification between localization tasks (e.g., object detection [44] and segmentation [8, 23]) and VL understanding tasks (e.g., VQA [3] and image captioning [11]). Localization pre-training benefits VL tasks [1, 70], and the “localization->VLP” two-stage pre- training procedure [46, 57, 13, 56, 39, 37, 75, 42, 40] is the common practice in VL community. A long-standing challenge is the unification of localization and understanding, which aims at mutual benefit between these two kinds of tasks, simplified pre-training procedure, and reduced pre-training cost. However, these two kinds of tasks appear to be dramatically different: localization tasks are vision- only and require fine-grained output (e.g., bounding boxes or pixel masks), while VL understanding tasks emphasize fusion between two modalities and require high-level semantic outputs (e.g., answers or captions). [24, 28, 66] have made early attempts at unifying these tasks in a straightforward multi-task manner, where a low-level visual encoder is shared across tasks, and two separate high-level branches are designed for localization and VL understanding, respectively. The localization tasks are still vision-only and do not benefit from the rich semantics in vision-language data. As a result, such unified models see the marginal mutual benefit or even performance degradation [28] compared with task-specific models. * The two authors contributed equally. Work done at Microsoft Research. Corresponding author. 36th Conference on Neural Information Processing Systems (NeurIPS 2022). arXiv:2206.05836v2 [cs.CV] 11 Oct 2022