Bayesian Nonparametric Matrix Factorization for Recorded Music Matthew D. Hoffman MDHOFFMA@CS. PRINCETON. EDU David M. Blei BLEI @CS. PRINCETON. EDU Perry R. Cook PRC@CS. PRINCETON. EDU Princeton University, Department of Computer Science, 35 Olden St., Princeton, NJ, 08540 USA Abstract Recent research in machine learning has focused on breaking audio spectrograms into separate sources of sound using latent variable decom- positions. These methods require that the num- ber of sources be specified in advance, which is not always possible. To address this problem, we develop Gamma Process Nonnegative Matrix Factorization (GaP-NMF), a Bayesian nonpara- metric approach to decomposing spectrograms. The assumptions behind GaP-NMF are based on research in signal processing regarding the expected distributions of spectrogram data, and GaP-NMF automatically discovers the number of latent sources. We derive a mean-field variational inference algorithm and evaluate GaP-NMF on both synthetic data and recorded music. 1. Introduction Recent research in machine learning has focused on break- ing audio spectrograms into separate sources of sound us- ing latent variable decompositions. Such decompositions have been applied to identifying individual instruments and notes, e.g., for music transcription (Smaragdis & Brown, 2003), to predicting hidden or distorted signals (Bansal et al., 2005), and to source separation (evotte et al., 2009). A problem with these methods is that the number of sources must be specified in advance, or found with expensive tech- niques such as cross-validation. This problem is particu- larly relevant when analyzing music. We want the discov- ered latent components to correspond to real-world sound sources, and we cannot expect the same number of sources to be present in every recording. In this article, we develop Gamma Process Nonnegative Matrix Factorization (GaP-NMF), a Bayesian nonparamet- Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). ric (BNP) approach to decomposing spectrograms. We posit a generative probabilistic model of spectrogram data where, given an observed audio signal, posterior inference reveals both the latent sources and their number. The central computational challenge posed by our model is posterior inference. Unlike other BNP factorization methods, our model is not composed of conjugate pairs of distributions—we chose our distributions to be appropriate for spectrogram data, not for computational convenience. We use variational inference to approximate the posterior, and develop a novel variational approach to inference in nonconjugate models. Variational inference approximates the posterior with a simpler distribution, whose parameters are optimized to be close to the true posterior (Jordan et al., 1999). In mean-field variational inference, each variable is given an independent distribution, usually of the same fam- ily as its prior. Where the model is conjugate, optimization proceeds by an elegant coordinate ascent algorithm. Re- searchers usually appeal to less efficient scalar optimization where conjugacy is absent. We instead use a bigger varia- tional family than the model initially asserts. We show that this gives an analytic coordinate ascent algorithm, of the kind usually limited to conjugate models. We evaluated GaP-NMF on several problems—extracting the sources from music audio, predicting the signal in miss- ing entries of the spectrogram, and classical measures of Bayesian model fit. Our model performs as well as or better than the current state-of-the-art. It finds simpler representa- tions of the data with equal statistical power, without need- ing to explore many fits over many numbers of sources, and thus with much less computation. 2. GaP-NMF Model We model the Fourier power spectrogram X of an audio signal. The spectrogram X is an M by N matrix of non- negative reals; the cell X mn is the power of our input au- dio signal at time window n and frequency bin m. Each column of the power spectrogram is obtained as follows. First, take the discrete Fourier transform of a window of