Symbolic Music Generation with Transformer-GANs Aashiq Muhamed 1 * , Liang Li 1 * , Xingjian Shi 1 , Suri Yaddanapudi 1 , Wayne Chi 1 , Dylan Jackson 1 , Rahul Suresh 1 , Zachary C. Lipton 2 , Alexander J. Smola 1 1 Amazon Web Services, 2 Carnegie Mellon University {muhaaash, mzliang, xjshi, yaddas, waynchi, jacdylan, surerahu}@amazon.com zlipton@cmu.edu, alex@smola.org Abstract Autoregressive models using Transformers have emerged as the dominant approach for music generation with the goal of synthesizing minute-long compositions that exhibit large- scale musical structure. These models are commonly trained by minimizing the negative log-likelihood (NLL) of the ob- served sequence in an autoregressive manner. Unfortunately, the quality of samples from these models tends to degrade significantly for long sequences, a phenomenon attributed to exposure bias. Fortunately, we are able to detect these fail- ures with classifiers trained to distinguish between real and sampled sequences, an observation that motivates our explo- ration of adversarial losses to complement the NLL objective. We use a pre-trained Span-BERT model for the discriminator of the GAN, which in our experiments helped with training stability. We use the Gumbel-Softmax trick to obtain a differ- entiable approximation of the sampling process. This makes discrete sequences amenable to optimization in GANs. In ad- dition, we broke the sequences into smaller chunks to ensure that we stay within a given memory budget. We demonstrate via human evaluations and a new discriminative metric that the music generated by our approach outperforms a baseline trained with likelihood maximization, the state-of-the-art Mu- sic Transformer, and other GANs used for sequence genera- tion. 57% of people prefer music generated via our approach while 43% prefer Music Transformer. Introduction At present, neural sequence models are generally trained to maximize the likelihood of the observed sequences. This en- sures statistical consistency but it can lead to undesirable artifacts when generating long sequences. While these ar- tifacts are difficult to suppress with maximum likelihood training alone, they are easily detected by most sequence classifiers. We take advantage of this fact, incorporating an adversarial loss derived from GANs. To illustrate its bene- fits, we demonstrate improvements in the context of sym- bolic music generation. Generative modeling as a field has progressed signifi- cantly in recent years, particularly with respect to creative applications such as art and music (Briot, Hadjeres, and Pa- chet 2017; Carnovalini and Rod` a 2020; Anantrasirichai and * Equal contribution, corresponding authors. Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Bull 2020). A popular application is the generation of sym- bolic music, a task that presents unique challenges not found in text generation due to polyphony and rhythm. At the same time, generating symbolic music can be simpler than audio generation due to the higher level of abstraction. Many lan- guage models from the NLP literature have been applied and extended to music generation. Since we build on this line of work, we use the terms sequence models and language mod- els interchangeably throughout, depending on the context in which a model is mentioned. Neural models for music sequences convert a digital rep- resentation of a musical score into a time-ordered sequence of discrete tokens. Language models are then trained on the event sequences with the objective of maximizing the likeli- hood of the data. Music can then be generated by sampling or beam-decoding from this model. Recent advancements in Natural Language Processing (NLP), especially the atten- tion mechanism and the Transformer architecture (Vaswani et al. 2017), have helped advance state of the art in symbolic music generation (Huang et al. 2018; Payne 2019; Donahue et al. 2019). Music Transformer (Huang et al. 2018) and MuseNet (Payne 2019) use relative attention and sparse ker- nels (Child et al. 2019) respectively to remember long-term structure in the composition. More recent works in music generation (Donahue et al. 2019; Huang and Yang 2020; Wu, Wang, and Lei 2020) adopt the TransformerXL archi- tecture (Dai et al. 2019) which uses recurrent memory to attend beyond a fixed context. Despite recent improvements, these approaches exhibit crucial failure modes which we argue arise from the training objective; Music Transformer (Huang et al. 2018) occasion- ally forgets to switch off notes and loses coherence beyond a few target lengths as stated by the authors. Sometimes it pro- duces highly repetitive songs, sections that are almost empty, and discordant jumps between contrasting phrases and mo- tifs. Consequently, music generated by such models can be distinguished from real music by a simple classifier. This suggests that a distribution distance, such as the discrimi- native objective of a GAN (Goodfellow et al. 2014) should improve the fidelity of the generative model. Unfortunately, incorporating GAN losses for discrete se- quences can be difficult. Computing the derivative of the samples through the discrete sampling process is challeng- ing. As such, many models (de Masson d’Autume et al.