Int J Speech Technol (2011) 14: 19–33 DOI 10.1007/s10772-010-9086-9 Application of prosody models for developing speech systems in Indian languages K. Sreenivasa Rao Received: 30 September 2010 / Accepted: 2 December 2010 / Published online: 11 December 2010 © Springer Science+Business Media, LLC 2010 Abstract In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedfor- ward neural networks are considered as prosody models. La- belled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural net- work models for predicting the duration and intonation. The features representing the positional, contextual and phono- logical constraints are used for developing the prosody mod- els. In this paper, the use of prosody models is illustrated us- ing speech recognition, speech synthesis, speaker recogni- tion and language identiﬁcation applications. Autoassocia- tive neural networks and support vector machines are used as classiﬁcation models for developing the speech systems. The performance of the speech systems has shown to be im- proved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefﬁcients (WLPCCs). Keywords Prosody · Duration · Intonation · Feedforward neural network · Speech systems 1 Introduction Speech can be viewed as the output of time varying vocal tract system by exciting it with a time varying excitation source signal. During production of speech human beings seem to impose durational constraints and intonation pat- terns on top of the vocal tract response of the sequence of K.S. Rao ( ) School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur 721302, West Bengal, India e-mail: ksrao@iitkgp.ac.in sound units to convey the intended message (Stevens 1999; Benesty et al. 2008). For human beings the prosody (dura- tion, intonation and loudness patterns) knowledge is natu- rally acquired, and it is difﬁcult to articulate this knowledge. Acoustic analysis and synthesis experiments have shown that duration and intonation patterns are the two most im- portant prosodic features responsible for the quality of syn- thesized speech (Huang et al. 2001). The prosodic patterns not only possess the characteristic of the speech message and the language, but they also possess characteristics of the speaker. Even in speech recognition, human beings seem to rely on the prosody cues to disambiguate errors in the perceived sounds (Lee et al. 2001; Benesty et al. 2008; Werner and Keller 1994). Thus acquisition and incorpora- tion of prosody knowledge becomes important for develop- ing speech systems. In this paper we focus on the models to capture the prosodic knowledge from speech, and the ac- quired prosodic knowledge is further exploited for develop- ing speech systems in Indian languages. The vocal tract characteristics of different sound units can be appropriately modeled using spectrum analysis. Hence, the associated parameters such as linear prediction coef- ﬁcients (LPC) and cepstral coefﬁcients (CC) are used to optimally represent the vocal tract response of the sound units (Rabiner and Juang 1993). The prosodic character- istics such as the duration and intonation patterns for the sequence of sound units are difﬁcult to handle automati- cally by a machine. But, human beings mostly exploit the prosodic characteristics of speech for performing different speech tasks. It is known that prosodic characteristics are robust to different types of degradations, whereas spectral characteristics are sensitive to degradations (Yin et al. 2006; Werner and Keller 1994; Benesty et al. 2008). Figures 1 and 2 shows the speech signal and its spectrogram in clean and noisy environments.