Proc. of the 16 th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-6, 2013 MODELLING AND SEPARATION OF SINGING VOICE BREATHINESS IN POLYPHONIC MIXTURES Ricard Marxer, Jordi Janer * Music Technology Group Universitat Pompeu Fabra Barcelona, Spain ricard.marxer@upf.edu, jordi.janer@upf.edu ABSTRACT Most current source separation methods only target the voiced component of the singing voice. Besides the unvoiced conso- nant phonemes, the remaining breathiness is very noticeable to humans and it retains much of the phonetic and timbral infor- mation from the singer. We propose a low-latency method for estimating the spectrum of the breathiness component, which is taken into account when isolating the singing voice source from the mixture. The breathiness component is derived from the de- tected harmonic envelope in pitched vocal sounds. The separation of the voiced components is used in conjunction with an existing iterative approach based on spectrum factorization. Finally, we conduct an objective evaluation that demonstrates the separation improvement, supported also by a number of audio examples. 1. INTRODUCTION Breathiness is an aspect of voice quality that is difﬁcult to esti- mate or analyze due to its stochastic nature and wideband spectral characteristics. In western music mixture signals, this component often overlaps with other wideband components such as drums or transients. To our knowledge there are no music source separation methods that have focused on this component of the singing voice. However, in the ﬁeld of speech analysis and synthesis, the decom- position and manipulation of the breathiness component has been done in a variety of areas such as text-to-speech synthesis, speech encoding, and clinical assessment of disordered voices. For example, in [1] the authors study the relations between the vocal tract and the glottal source in human speech signals. The work in [2] focuses on the analysis of the breathy component of speech voice. It proposes a modulation-based model where the noise component of the voice is modulated by the glottal wave- form. This model is used to analyse, synthesize and transform isolated voice recordings. [3] address the problem of separating the unvoiced components of the singing voice, however the au- thors focus on consonants and no speciﬁc breathiness models are proposed. [4] propose an extension to the source-ﬁlter model that takes into account turbulence at the glottal level and the radia- tion at the lips and nostrils level. The proposed model Separa- tion of the Vocal-tract with the Liljencrants-fant model plus Noise (SVLN) shows beneﬁts in pitch transformation and breathiness control tasks for singing voice synthesis. All of these works focus on voice signals in isolation and do not consider the source separation problem nor the analysis of mix- ture signals. * This work was supported by the Yamaha Corporation 2. PROPOSED ESTIMATION METHOD Our method can be integrated into any source separation approach that approximates the mixture spectrum as the sum of the lead singing voice and the accompaniment spectra V = Xv + Xm. It is appropriate for both low-latency and high-latency situations since it only requires a single audio frame. The estimation of the breathiness component is based on the approximation of a pitched voice spectrum (with pitch f0) as a ﬁl- tered composition of two additive components: a glottal excitation Xv and a wideband component (due to the glottal air ﬂow) Xvr , both ﬁltered by the vocal tract. The magnitude of the voice spec- trum can be expressed in the following manner [4] (see Figure 1): X v ′ [ω] = Xv [ω]+ Xvr [ω] (1) = L[ω]U [ω]S[ω]H[ω]+ L[ω]U [ω]γ (2) = L[ω]U [ω](S[ω]H[ω]+ γ) (3) where S[ω]H[ω] is the spectrum of the excitation, S[ω] is the ex- citation envelope, H[ω] is a harmonic comb of unity magnitude, γU [ω] is the magnitude spectrum of the breathiness, U [ω] is the magnitude of the frequency response of the vocal tract ﬁlter, γ is the gain of the breathiness spectrum relative to the pitched compo- nent, and L[ω] is the component due to lips and nostrils radiation. Here we approximate the wideband component as a constant spec- trum ﬁltered by the vocal tract. This is equivalent to modeling the glottal air ﬂow as white noise, which is realistic specially for a mid-range frequency region. The human voice excitation envelope can be modeled, as pro- posed in [5], using a linear decay in the decibel/octave scale: S[ω]= C · ω m/20 log 10 (2) (4) where C is a scaling factor, ω is the frequency in Hz, and m is the slope of the excitation envelope in decibels per octave (dB/octave). In our scenario the vocal source spectrum X v ′ is unavailable, only the mixture spectrum V is accessible. Therefore we can- not directly estimate the breathiness spectrum γU [ω] using Equa- tion 1. Instead, we exploit the fact that at harmonic positions lf0 of the singing voice pitch we can consider the vocals spectrum predominant V [lf0] ≈ X v ′ [lf0] for all harmonic indices l> 0. In this work the pitch is estimated using the method presented in [6]. If we additionally consider the vocal tract ﬁlter smooth in frequency, as is done in previous works [7], we can then use inter- polation between the harmonic positions to estimate the harmonic envelope e h [ω]= L[ω]U [ω]S[ω] as done in [6]. By assuming the magnitudes (in the decibel scale) of L[ω]U [ω] to be drawn from a DAFX-1 Proc. of the 16th Int. Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2-5, 2013