Non-negative Hidden Markov Modeling of Audio with Application to Source Separation Gautham J. Mysore 1 , Paris Smaragdis 2 , and Bhiksha Raj 3 1 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems Inc., 3 School of Computer Science, Carnegie Mellon University Abstract. In recent years, there has been a great deal of work in mod- eling audio using non-negative matrix factorization and its probabilistic counterparts as they yield rich models that are very useful for source separation and automatic music transcription. Given a sound source, these algorithms learn a dictionary of spectral vectors to best explain it. This dictionary is however learned in a manner that disregards a very important aspect of sound, its temporal structure. We propose a novel algorithm, the non-negative hidden Markov model (N-HMM), that ex- tends the aforementioned models by jointly learning several small spec- tral dictionaries as well as a Markov chain that describes the structure of changes between these dictionaries. We also extend this algorithm to the non-negative factorial hidden Markov model (N-FHMM) to model sound mixtures, and demonstrate that it yields superior performance in single channel source separation tasks. 1 Introduction A common theme in most good strategies to modeling audio is the ability to make use of structure. Non-negative factorizations such as non-negative matrix factorization (NMF) and probabilistic latent component analysis (PLCA) have been shown to be powerful in representing spectra as a linear combination of vec- tors from a dictionary [1]. Such models take advantage of the inherent low-rank nature of magnitude spectrograms to provide compact and informative descrip- tions. Hidden Markov models (HMMs) have instead made use of the inherent temporal structure of audio and have shown to be particularly powerful in mod- eling sounds in which temporal structure is important, such as speech [2]. In this work, we propose a new model that combines the rich spectral representa- tive power of non-negative factorizations and the temporal structure modeling of HMMs. In [3], ideas from non-negative factorizations and HMMs have been used by representing sound mixtures as a linear combination of spectral vectors and also modeling the temporal structure of each source. However, at a given time frame, each source is represented by a single spectral vector rather than a linear combi- nation of multiple spectral vectors. As pointed out, this has some virtue in speech as it is monophonic but it can break down when representing rich polyphonic sources such as music, for which one would resort to using standard NMF. In our This work was performed while interning at Adobe Systems Inc.