DECOMPOSITION OF SPEECH SIGNALS FOR ANALYSIS OF APERIODIC COMPONENTS
OF EXCITATION
B. Yegnanarayana, M. Anand Joseph. & Suryakanth V. G.
Language Technologies Research Center
International Institute of Information Technology
Gachibowli, Hyderabad, India
yegna@iiit.ac.in, anandjm@research.iiit.ac.in, svg@iiit.ac
Dhananjaya N.
Department of Comp. Sci. & Engg.
Indian Institute of Technology Madras
Chennai, India
dhanu@cse.iitm.ac.in
ABSTRACT
The motivation for this study is the need for careful analysis of aperi-
odicity of the excitation component in expressive voices. The paper
proposes analysis methods which can preserve the excitation infor-
mation corresponding to sequence of impulse-like excitation with
variable strengths. To analyze the details of the excitation source
characteristics, the epochs and the strength of the excitation at the
epochs are obtained using the output of an ideal zero-frequency dig-
ital resonator. The vocal tract system characteristics are derived from
the signal between two successive epochs using the numerator of the
group delay function. The spectrogram of the zero-frequency filtered
signal and the group delay spectrum correspond to characteristics of
the excitation and the vocal tract system, respectively. Decomposi-
tion of the speech signal into these two components bring out the
features of excitation and vocal tract system, which can be used to
explain the perception of expressive voices in terms of features of
aperiodicity, pitch, harmonics and sub-harmonics. The decomposi-
tion method is illustrated using examples from linguistically signif-
icant glottalized sounds (glottal stops and ejectives), singing voices
and Noh voice.
Index Terms— Epochs, group delay spectra, aperiodicity, sub-
harmonics, glottalized sounds, singing voice, Noh voice
1. INTRODUCTION
Speech signals are produced by exciting the time varying vocal tract
system with a time varying excitation. Source filter theory is nor-
mally assumed for analysis of the characteristics of the vocal tract
system and the excitation source components. The operation of the
source-filter combination is assumed to be linear in extracting the
component information from the speech signal. However, the speech
signals are generated by the strong nonlinear physiological system of
the human speech production. In particular, the nonlinearity of the
vocal fold tissues in the vibration of the vocal folds at the glottis is
the primary mode of excitation of the vocal tract system. The re-
sulting excitation component is assumed to be quasi-periodic. The
chaotic components due to air turbulence, especially near the glot-
tal closure, are assumed to be additive in the linear acoustic sys-
tem model. The aperiodicity of the vocal fold vibration is gener-
ally assumed to be a small perturbation or deviation from the quasi-
periodicity assumptions.
But in actual speech production, the nonlinearity and the aperi-
odicity of voiced signals convey not only linguistics information in
certain sounds [1] [2], but also indicate the special quality of the
voice source in certain singing voices [3, 4]. The extremely ex-
pressive voice quality in special types of artistic voices as in Noh
(a traditional performing art of Japan) [3] and in singing [4] demon-
strates the significance of the sophistication of voice signals, which
are perceived and appreciated by human listeners, but are extremely
difficult to express the quality in quantitative terms. This expressive
voice quality also conveys the emotional message by the performer.
Therefore what is needed is to understand the significance of
various components of expressive speech, especially the voiced exci-
tation produced at the glottis, and extract the component information
from speech signals to derive some measurable parameters to quan-
tify such voices. The expressive components of speech is mostly due
to aperiodicity and turbulence of the voice signal generated by the vi-
brating vocal folds at the glottis due to subtle control of the organs
involved in speech production by a trained artist. Due to extreme
nature of variations of the vocal mechanism, these aperiodic and
turbulent components cannot be treated as merely deviations from
the quasi-periodicity and additive random component to the linear
model.
One clue for analysis of such signals is to assume that the vocal
tract system is excited by a sequence of impulse-like excitations oc-
curring at irregular intervals, and with non-uniform strengths. Each
of these exciting impulses produces a response of both the vibrating
system at the glottis as well as the dynamic vocal tract system in-
cluding the nasal tract. It may also be assumed that the perception
of pitch, harmonic and sub-harmonic components of voice signals in
speech could be due to the sequence of impulse like excitations oc-
curring at regular or irregular intervals, with non-uniform strengths.
For example, it is obvious that the sinusoidal (artificial larynx) or
random noise excitation (as in breathy voice) cannot produce percep-
tion of harmonics and sub-harmonics of the fundamental (i.e., pitch).
Note that the information of the production of the impulse-like ex-
citation sequence can be described well only in the time domain,
and not through transform domain parameters such as harmonics,
spectral amplitudes, etc., as the latter description requires process-
ing a block of speech signal. The choice of the size of the block
is somewhat arbitrary. Moreover, such block processing is likely to
smear the perceptually vital information in the timing information
in the sequence of impulses. It is also likely that the block process-
ing may combine the effects of aperiodicity due to irregular intervals
of pulses and that due to non-reproducibility of the waveform be-
tween successive intervals, besides the random noise component due
to turbulence. Therefore it is preferable to extract and represent this
source information in the time domain itself as far as possible. Then
it may be easier to explain the perception of harmonic, sub-harmonic
and breathy characteristics of the excitation source effectively.
5396 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011