Multi-Variate Temporal GAN for Large Scale Video Generation Andres Munoz * , Mohammadreza Zolfaghari * , Max Argus, Thomas Brox University of Freiburg {amunoz, zolfagha, argusm, brox}@informatik.uni-freiburg.de Abstract In this paper, we present a network architecture for video generation that models spatio-temporal consistency with- out resorting to costly 3D architectures. In particular, we elaborate on the components of noise generation, sequence generation, and frame generation. The architecture facil- itates the information exchange between neighboring time points, which improves the temporal consistency of the gen- erated frames both at the structural level and the detailed level. The approach achieves state-of-the-art quantitative performance, as measured by the inception score, on the UCF-101 dataset, which is in line with a qualitative inspec- tion of the generated videos. We also introduce a new quan- titative measure that uses downstream tasks for evaluation. 1. Introduction Generative Adversarial Networks (GANs) [21] have en- abled powerful ways to generate high-resolution im- ages [48, 3]. Video generation adds further complexity, as the resulting content should not only make sense spatially but should also be temporally coherent. This is particularly true for the aspect of motion, which does not exist in still images. 3D Convolutional Neural Network (CNN) architectures ap- pear well-suited to trivially lift the progress made in single images to videos [13, 7, 40, 18], yet their usefulness for video generation is still being debated [44, 37]. One argu- ment against 3D CNNs is that the temporal dimension be- haves differently from the spatial dimensions. The authors of MoCoGAN [44] showed that equal treatment of space and time results in ﬁxed-length videos, whereas the length of real-world videos varies. Moreover, according to stud- ies in literature [23, 15] 3D CNNs have more parameters, which makes them more prone to overﬁtting [19]. We share the view of TGAN [37] and MoCoGAN [44], where instead of mapping a single point in the latent space * Equal Contribution (b) MoCoGAN (a) MDP (c) TGANv2 (d) Ours (e) Jester (Ours) Time Figure 1: Our proposed MVT-TSA method produces both high quality video frames and coherent motion. Frame qual- ity is shown by results for UCF-101 comparing (a) MDP [52], (b) MoCoGAN [44] and (c) TGANv2 [42] to our method (d). Motion quality is shown by a sample of frames from MVT-TSA trained on the Jester [24] dataset (e). to a video, a video is assumed to be a smooth sequence of points in a latent space where each point corresponds to a single frame in the video. As a result, our video generator consists of two submodules: a sequence generator that gen- erates a sequence of points in the latent space, and an image generator that maps these points into image space. For the image generator, we propose a Temporal Self- Attention Generator, which introduces a temporal shifting mechanism into residual blocks of the generator. We repur- pose the Temporal Shift Module (TSM) proposed by Lin et 1 arXiv:2004.01823v1 [cs.CV] 4 Apr 2020