Workshop track - ICLR 2018 C APTURING MUSICAL STRUCTURE USING C ONVOLU - TIONAL R ECURRENT L ATENT VARIABLE MODEL Eunjeong Stella Koh * , Dustin Wright * & Shlomo Dubnov University of California, San Diego * denotes equal contribution ABSTRACT In this paper, we present a model for learning musical features and generating novel sequences of music. Our model, the Convolutional-Recurrent Variational Au- toencoder (C-RVAE), captures short-term polyphonic sequential musical structure using a Convolutional Neural Network as a front-end. To generate sequential data, we apply the recurrent latent variational model, which uses an encoder-decoder architecture with latent probabilistic connections to capture the hidden structure of music. Using the sequence-to-sequence model, our generative model can exploit samples from a prior distribution and generate a longer sequence of music. 1 I NTRODUCTION Previous studies (Chen & Miikkulainen 2001; Waite et al.; Boulanger-Lewandowski et al. 2012) on Recurrent Neural Network (RNN) for music generation face two major challenges related to: 1) understanding the higher level semantics of musical structure, which is critical to music composition, and 2) simple but repetitive patterns in the musical output (Bretan et al. 2016). The musical surface is difﬁcult to represent when dealing with multiple instruments within a piece (polyphony). To address this, many music generation models rely on pre-processing of the musical features prior to training. In this study, the difﬁculty in representing polyphony is handled using a Convolutional Neural Network (CNN) acting on the symbolic representation of the input music. CNNs have been utilized for music generation in previous work starting from audio-domain music data (i.e., wav) (Oord et al. 2016), and recent studies introduce a CNN with symbolic-domain data, leading to innovations in music generation with complex melodies. C-RNN-GAN (Mogren 2016), MidiNet (Yang et al. 2017), and MuseGAN (Dong et al. 2017) developed different models, which encourage researchers to use CNN for capturing musical structure. Several approaches have been studied for: 1) covering multi-channel MIDIs by CNN layers, 2) setting several generative models for multi-track data generation, and 3) processing longer sequences of data. As an extension of these advancements, we explore the Convolutional-Recurrent Variational Autoencoder (C-RVAE), which is an effective method of learning useful musical features that we use for polyphonic music generation. In the studies by Hennig et al. (2017); Roberts et al., the variational autoencoder (VAE) has been shown to be useful for musical creation. In this same vein, the Variational Recurrent Neural Network, introduced in Fabius & van Amersfoort 2014; Chung et al. 2015, was shown to perform well on generating sequential outputs by integrating latent random variables in a recurrent neural network. For the latent variable structure, the model utilizes encoded data in latent space in each step, and the studies argue that these recurrent steps can make it possible to be ﬂexible on the generation of more diverse styles of music while incorporating features from data in a concrete way. With these possibilities in mind, we propose that our model can extract musical structure using the VAE structure in a recurrent network combined with the CNN for learning a representation of symbolic domain music. 2 METHOD 2.1 FEATURE LEARNING WITH CONVOLUTIONAL NEURAL NETWORK We adopt a CNN in order to learn a better representation of polyphonic music by treating the input as a 2D binary feature map. This is predicated on the notion that the arrangement of notes in a musical 1