Published as a conference paper at ICLR 2022 V I TGAN: T RAINING GAN S WITH V ISION T RANS - FORMERS Kwonjoon Lee 1,3 Huiwen Chang 2 Lu Jiang 2 Han Zhang 2 Zhuowen Tu 1 Ce Liu 4 1 UC San Diego 2 Google Research 3 Honda Research Institute 4 Microsoft Azure AI kwl042@eng.ucsd.edu {huiwenchang,lujiang,zhanghan}@google.com ztu@ucsd.edu ce.liu@microsoft.com ABSTRACT Recently, Vision Transformers (ViTs) have shown competitive performance on image recognition while requiring less vision-speciﬁc inductive biases. In this paper, we investigate if such performance can be extended to image generation. To this end, we integrate the ViT architecture into generative adversarial networks (GANs). For ViT discriminators, we observe that existing regularization methods for GANs interact poorly with self-attention, causing serious instability during training. To resolve this issue, we introduce several novel regularization techniques for training GANs with ViTs. For ViT generators, we examine architectural choices for latent and pixel mapping layers to facilitate convergence. Empirically, our approach, named ViTGAN, achieves comparable performance to the leading CNN- based GAN models on three datasets: CIFAR-10, CelebA, and LSUN bedroom. Our code is available online 1 . 1 I NTRODUCTION Convolutional neural networks (CNNs) (LeCun et al., 1989) are dominating computer vision today, thanks to their powerful capability of convolution (weight-sharing and local-connectivity) and pooling (translation equivariance). Recently, however, Transformer architectures (Vaswani et al., 2017) have started to rival CNNs in many vision tasks. In particular, Vision Transformers (ViTs) (Dosovitskiy et al., 2021), which interpret an image as a sequence of tokens (analogous to words in natural language), have been shown to achieve comparable classiﬁcation accuracy with smaller computational budgets (i.e., fewer FLOPs) on the ImageNet benchmark. Unlike CNNs, ViTs capture a different inductive bias through self-attention where each patch is attended to all patches of the same image. ViTs, along with their variants (Touvron et al., 2020; Tolstikhin et al., 2021), though still in their infancy, have demonstrated advantages in modeling non-local contextual dependencies (Ranftl et al., 2021; Strudel et al., 2021) as well as promising efﬁciency and scalability. Since their recent inception, ViTs have been used in various tasks such as object detection (Beal et al., 2020), video recognition (Bertasius et al., 2021; Arnab et al., 2021), multitask pre-training (Chen et al., 2020a), etc. In this paper, we examine whether Vision Transformers can perform the task of image generation without using convolution or pooling, and more speciﬁcally, whether ViTs can be used to train generative adversarial networks (GANs) with comparable quality to CNN-based GANs. While we can naively train GANs following the design of the standard ViT (Dosovitskiy et al., 2021), we ﬁnd that GAN training becomes highly unstable when coupled with ViTs, and that adversarial training is frequently hindered by high-variance gradients in the later stage of discriminator training. Furthermore, conventional regularization methods such as gradient penalty (Gulrajani et al., 2017; Mescheder et al., 2018), spectral normalization (Miyato et al., 2018) cannot resolve the instability issue, even though they are proved to be effective for CNN-based GAN models (shown in Fig. 4). As unstable training is uncommon in the CNN-based GANs training with appropriate regularization, this presents a unique challenge to the design of ViT-based GANs. 1 https://github.com/mlpc-ucsd/ViTGAN 1