DECOMPOSITION OF SPEECH SIGNALS FOR ANALYSIS OF APERIODIC COMPONENTS OF EXCITATION B. Yegnanarayana, M. Anand Joseph. & Suryakanth V. G. Language Technologies Research Center International Institute of Information Technology Gachibowli, Hyderabad, India yegna@iiit.ac.in, anandjm@research.iiit.ac.in, svg@iiit.ac Dhananjaya N. Department of Comp. Sci. & Engg. Indian Institute of Technology Madras Chennai, India dhanu@cse.iitm.ac.in ABSTRACT The motivation for this study is the need for careful analysis of aperi- odicity of the excitation component in expressive voices. The paper proposes analysis methods which can preserve the excitation infor- mation corresponding to sequence of impulse-like excitation with variable strengths. To analyze the details of the excitation source characteristics, the epochs and the strength of the excitation at the epochs are obtained using the output of an ideal zero-frequency dig- ital resonator. The vocal tract system characteristics are derived from the signal between two successive epochs using the numerator of the group delay function. The spectrogram of the zero-frequency ﬁltered signal and the group delay spectrum correspond to characteristics of the excitation and the vocal tract system, respectively. Decomposi- tion of the speech signal into these two components bring out the features of excitation and vocal tract system, which can be used to explain the perception of expressive voices in terms of features of aperiodicity, pitch, harmonics and sub-harmonics. The decomposi- tion method is illustrated using examples from linguistically signif- icant glottalized sounds (glottal stops and ejectives), singing voices and Noh voice. Index Terms— Epochs, group delay spectra, aperiodicity, sub- harmonics, glottalized sounds, singing voice, Noh voice 1. INTRODUCTION Speech signals are produced by exciting the time varying vocal tract system with a time varying excitation. Source ﬁlter theory is nor- mally assumed for analysis of the characteristics of the vocal tract system and the excitation source components. The operation of the source-ﬁlter combination is assumed to be linear in extracting the component information from the speech signal. However, the speech signals are generated by the strong nonlinear physiological system of the human speech production. In particular, the nonlinearity of the vocal fold tissues in the vibration of the vocal folds at the glottis is the primary mode of excitation of the vocal tract system. The re- sulting excitation component is assumed to be quasi-periodic. The chaotic components due to air turbulence, especially near the glot- tal closure, are assumed to be additive in the linear acoustic sys- tem model. The aperiodicity of the vocal fold vibration is gener- ally assumed to be a small perturbation or deviation from the quasi- periodicity assumptions. But in actual speech production, the nonlinearity and the aperi- odicity of voiced signals convey not only linguistics information in certain sounds [1] [2], but also indicate the special quality of the voice source in certain singing voices [3, 4]. The extremely ex- pressive voice quality in special types of artistic voices as in Noh (a traditional performing art of Japan) [3] and in singing [4] demon- strates the signiﬁcance of the sophistication of voice signals, which are perceived and appreciated by human listeners, but are extremely diﬃcult to express the quality in quantitative terms. This expressive voice quality also conveys the emotional message by the performer. Therefore what is needed is to understand the signiﬁcance of various components of expressive speech, especially the voiced exci- tation produced at the glottis, and extract the component information from speech signals to derive some measurable parameters to quan- tify such voices. The expressive components of speech is mostly due to aperiodicity and turbulence of the voice signal generated by the vi- brating vocal folds at the glottis due to subtle control of the organs involved in speech production by a trained artist. Due to extreme nature of variations of the vocal mechanism, these aperiodic and turbulent components cannot be treated as merely deviations from the quasi-periodicity and additive random component to the linear model. One clue for analysis of such signals is to assume that the vocal tract system is excited by a sequence of impulse-like excitations oc- curring at irregular intervals, and with non-uniform strengths. Each of these exciting impulses produces a response of both the vibrating system at the glottis as well as the dynamic vocal tract system in- cluding the nasal tract. It may also be assumed that the perception of pitch, harmonic and sub-harmonic components of voice signals in speech could be due to the sequence of impulse like excitations oc- curring at regular or irregular intervals, with non-uniform strengths. For example, it is obvious that the sinusoidal (artiﬁcial larynx) or random noise excitation (as in breathy voice) cannot produce percep- tion of harmonics and sub-harmonics of the fundamental (i.e., pitch). Note that the information of the production of the impulse-like ex- citation sequence can be described well only in the time domain, and not through transform domain parameters such as harmonics, spectral amplitudes, etc., as the latter description requires process- ing a block of speech signal. The choice of the size of the block is somewhat arbitrary. Moreover, such block processing is likely to smear the perceptually vital information in the timing information in the sequence of impulses. It is also likely that the block process- ing may combine the eﬀects of aperiodicity due to irregular intervals of pulses and that due to non-reproducibility of the waveform be- tween successive intervals, besides the random noise component due to turbulence. Therefore it is preferable to extract and represent this source information in the time domain itself as far as possible. Then it may be easier to explain the perception of harmonic, sub-harmonic and breathy characteristics of the excitation source eﬀectively. 5396 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011