Int J Speech Technol (2011) 14: 19–33
DOI 10.1007/s10772-010-9086-9
Application of prosody models for developing speech systems
in Indian languages
K. Sreenivasa Rao
Received: 30 September 2010 / Accepted: 2 December 2010 / Published online: 11 December 2010
© Springer Science+Business Media, LLC 2010
Abstract In this paper we demonstrate the use of prosody
models for developing speech systems in Indian languages.
Duration and intonation models developed using feedfor-
ward neural networks are considered as prosody models. La-
belled broadcast news data in the languages Hindi, Telugu,
Tamil and Kannada is used for developing the neural net-
work models for predicting the duration and intonation. The
features representing the positional, contextual and phono-
logical constraints are used for developing the prosody mod-
els. In this paper, the use of prosody models is illustrated us-
ing speech recognition, speech synthesis, speaker recogni-
tion and language identification applications. Autoassocia-
tive neural networks and support vector machines are used
as classification models for developing the speech systems.
The performance of the speech systems has shown to be im-
proved by combining the prosodic features along with one
popular spectral feature set consisting of Weighted Linear
Prediction Cepstral Coefficients (WLPCCs).
Keywords Prosody · Duration · Intonation · Feedforward
neural network · Speech systems
1 Introduction
Speech can be viewed as the output of time varying vocal
tract system by exciting it with a time varying excitation
source signal. During production of speech human beings
seem to impose durational constraints and intonation pat-
terns on top of the vocal tract response of the sequence of
K.S. Rao ( )
School of Information Technology, Indian Institute of Technology
Kharagpur, Kharagpur 721302, West Bengal, India
e-mail: ksrao@iitkgp.ac.in
sound units to convey the intended message (Stevens 1999;
Benesty et al. 2008). For human beings the prosody (dura-
tion, intonation and loudness patterns) knowledge is natu-
rally acquired, and it is difficult to articulate this knowledge.
Acoustic analysis and synthesis experiments have shown
that duration and intonation patterns are the two most im-
portant prosodic features responsible for the quality of syn-
thesized speech (Huang et al. 2001). The prosodic patterns
not only possess the characteristic of the speech message
and the language, but they also possess characteristics of the
speaker. Even in speech recognition, human beings seem
to rely on the prosody cues to disambiguate errors in the
perceived sounds (Lee et al. 2001; Benesty et al. 2008;
Werner and Keller 1994). Thus acquisition and incorpora-
tion of prosody knowledge becomes important for develop-
ing speech systems. In this paper we focus on the models
to capture the prosodic knowledge from speech, and the ac-
quired prosodic knowledge is further exploited for develop-
ing speech systems in Indian languages.
The vocal tract characteristics of different sound units can
be appropriately modeled using spectrum analysis. Hence,
the associated parameters such as linear prediction coef-
ficients (LPC) and cepstral coefficients (CC) are used to
optimally represent the vocal tract response of the sound
units (Rabiner and Juang 1993). The prosodic character-
istics such as the duration and intonation patterns for the
sequence of sound units are difficult to handle automati-
cally by a machine. But, human beings mostly exploit the
prosodic characteristics of speech for performing different
speech tasks. It is known that prosodic characteristics are
robust to different types of degradations, whereas spectral
characteristics are sensitive to degradations (Yin et al. 2006;
Werner and Keller 1994; Benesty et al. 2008). Figures 1
and 2 shows the speech signal and its spectrogram in clean
and noisy environments.