LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation Mohammad Abuzar Shaikh * Zhanghexuan Ji * Dana Moukheiber Yan Shen Sargur Srihari Mingchen Gao Department of Computer Science and Engineering, University at Buffalo, The State University of New York, Buffalo, NY, USA {mshaikh2,zhanghex,danamouk,yshen22,srihari,mgao8}@buffalo.edu Abstract Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard ap- proach for many downstream vision-language tasks. The transformer-based models learn inter- and intra-modal at- tention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Vi- sual Textual Alignment (VTA) will be assisted by two aux- iliary tasks, GAN-based image synthesis and Image Cap- tioning. We also propose a new evaluation metric mea- suring the similarity between the learnt visual and tex- tual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior vi- sual and textual representation alignment in the joint fea- ture embedding space. Our code is available at https: //github.com/mshaikh2/LaViTeR 1. Introduction Learning cross-modal visual and textual representation is essential for bridging the semantic gap between images and languages. It is the cornerstone for a wide range of vision- language (V+L) tasks, such as image-text cross-modal re- trieval, visual question answering (VQA) [2], image cap- tioning [2], and so on. Inspired by the success of BERT [9] and XLNet [48] us- ing self-supervised learning on natural language processing, there has been a surging research interest in vision-language pre-training on image-text pairs. The learned task-agnostic representation is shown to be effective for many image- language applications after fine-tuning on specific down- * equal contributions. Visual Textual Alignment (VTA) Image to Text Module (ITM) Text to Image Module (TIM) woman holding child watching giraffe woman holding child giraffe watching Figure 1: An overview of the end-to-end LAViTeR network. VTA module is assisted by ITM and TIM modules, which in-turn learns to better align the corresponding visual and textual counterparts. The bidirectional arrows indicate the alignment between words and their respective objects in the given image. The intra-word arrows indicate the relation- ships between the input words that the network learns. stream tasks. Self-supervised learning is designed to ex- plore the organization of the data as its own source of su- pervision. This promising approach releases the burden of annotating data with ground truth labels, provides an oppor- tunity to explore a large amount of unlabeled data such as image-text pairs, video-text pairs in free form format from online platforms. This approach has been applied to radiol- ogy images combined with their associated reports [24, 5] to leverage the abundance of unlabeled medical data. This 1 arXiv:2109.04993v2 [cs.CV] 19 Oct 2021