Deep Generative Models of Music Expectation Ninon Lizé Masclef contact@ninonlizemasclef.com T. Anderson Keller t.anderson.keller@gmail.com Abstract A prominent theory of affective response to music revolves around the concepts of surprisal and expectation. In prior work, this idea has been operationalized in the form of probabilistic models of music which allow for precise computation of song (or note-by-note) probabilities, conditioned on a ‘training set’ of prior musical or cultural experiences. To date, however, these models have been limited to compute exact probabilities through hand-crafted features or restricted to linear models which are likely not sufficient to represent the complex conditional distributions present in music. In this work, we propose to use modern deep probabilistic generative models in the form of a Diffusion Model to compute an approximate likelihood of a musical input sequence. Unlike prior work, such a generative model parameterized by deep neural networks is able to learn complex non-linear features directly from a training set itself. In doing so, we expect to find that such models are able to more accurately represent the ‘surprisal’ of music for human listeners. From the literature, it is known that there is an inverted U-shaped relationship between surprisal and the amount human subjects ‘like’ a given song. In this work we show that pre-trained diffusion models indeed yield musical surprisal values which exhibit a negative quadratic relationship with measured subject ‘liking’ ratings, and that the quality of this relationship is competitive with state of the art methods such as IDyOM. We therefore present this model a preliminary step in developing modern deep generative models of music expectation and subjective likability. 1 Introduction The fields of psychology and musicology have identified expectation as a crucial factor that conveys meaning in music [14], especially affective response [12, 21, 26]. The malleability of musical experience underscores that listening to music is not merely a passive activity, but an active learning process in which expectations are formed that shape our emotional responses [22]. Indeed, prior work has found that the sweet spot of expectation which maximizes information learning and reward system-related responses is that of intermediate complexity [18]. Wilhelm Wundt [30] proposed an inverted U-curve to describe the relationship between stimulus intensity and pleasant feeling. However, it was operationalized decades later and applied specifically to aesthetic pleasure by Berlyne [4]. Since then, the Wundt effect has dominated psychological research on music preference for more than two decades [7]. Specifically, the author [3] found that arousal is a predominant factor in aesthetic preference, with three types of variables determining the level of arousal of a stimulus input. One in particular, collative variables, encompassing novelty, complexity, and uncertainty, has been shown to contribute most to musical liking [2]. Among these collative variables, predictability has been empirically shown to contribute to music preference [29], although this finding is not unanimous [17]. Furthermore, there is a growing body of research on serendipity as a new evaluation metric for music recommendation systems [31, 23, 5], linking surprisal and pleasure of music. Preprint. Under review. arXiv:2310.03500v1 [cs.SD] 5 Oct 2023