Data-driven voice source waveform analysis and synthesis Jon Gudnason a, , Mark R.P. Thomas b , Daniel P.W. Ellis c , Patrick A. Naylor b a School of Science and Engineering, Reykjavik University, Iceland b Electrical and Electronic Engineering Department, Imperial College London, London SW7 2AZ, UK c LabROSA, Columbia University, New York, NY 10027, USA Received 10 August 2010; received in revised form 11 August 2011; accepted 12 August 2011 Abstract A data-driven approach is introduced for studying, analyzing and processing the voice source signal. Existing approaches parameter- ize the voice source signal by using models that are motivated, for example, by a physical model or function-fitting. Such parameteri- zation is often difficult to achieve and it produces a poor approximation to a large variety of real voice source waveforms of the human voice. This paper presents a novel data-driven approach to analyze different types of voice source waveforms using principal com- ponent analysis and Gaussian mixture modeling. This approach models certain voice source features that many other approaches fail to model. Prototype voice source waveforms are obtained from each mixture component and analyzed with respect to speaker, phone and pitch. An analysis/synthesis scheme was set up to demonstrate the effectiveness of the method. Compression of the proposed voice source by discarding 75% of the features yields a segmental signal-to-reconstruction error ratio of 13 dB and a Bark spectral distortion of 0.14. Ó 2011 Elsevier B.V. All rights reserved. Keywords: Voice source signal; Inverse filtering; Vocal tract modeling; Principal component analysis; Gaussian mixture model; Segmental signal to reconstruction ratio 1. Introduction Voice analysis is very important for many areas of speech science and technology. The linear source-filter model of speech has propelled greater understanding of voice production (Fant, 1960; Flanagan, 1972; Strube, 1974; Wong et al., 1979) and given rise to advances in speech technology, such as coding (Makhoul, 1975; Spa- nias, 1994), synthesis (Atal and Hanauer, 1971; Moulines and Charpentier, 1990; Kumar and Gersho, 1997; Cataldo et al., 2006) and voice transformation (Childers et al., 1995; Stylianou et al., 1998). Fig. 1 shows a 35 ms speech segment and an estimate of the glottal flow derivative which is referred to as the voice source waveform. There are several ways of carrying out this estimate, but the task is essentially that of blind chan- nel identification and is traditionally implemented by inverse filtering (Alku, 1992). Inverse-filtered speech reveals some important features of the voice source waveforms as depicted in Fig. 1. The features include a discontinuity at the closure instants, a closed phase where the glottal flow is assumed to be zero and an open phase pulse. The voice source waveform during the closed phase interval can be noisy due to aspiration and/or under-modeling of the vocal tract. The signal produced by inverse-filtering can also con- tain other artifacts due to poor modeling of the vocal tract. Voice source waveform modeling is therefore dependent upon the optimality of the chosen inverse filtering method. The proposed voice source techniques model both features and artifacts and can, therefore, be used to improve sub- optimal inverse-filtering methods. Voice source modeling has proven to be very useful for understanding the voice production process and has aided the development of new speech processing technology. Many reports, however, only give vague or anecdotal 0167-6393/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2011.08.003 Corresponding author. E-mail addresses: jg@ru.is (J. Gudnason), mark.r.thomas02@imperial. ac.uk (M.R.P. Thomas), dpwe@ee.columbia.edu (D.P.W. Ellis), p.naylor @imperial.ac.uk (P.A. Naylor). www.elsevier.com/locate/specom Available online at www.sciencedirect.com Speech Communication xxx (2011) xxx–xxx Please cite this article in press as: Gudnason, J., et al. Data-driven voice source waveform analysis and synthesis, Speech Comm. (2011), doi:10.1016/ j.specom.2011.08.003