HMM-Based Speech Synthesis & Recognition for Malay and Arabic Languages Duration: 1 st April 2009 30 th April 2012 Principal Investigator: Raja Noor Ainon Zabariah Bt Raja Zainal Abidin Group Members: Dr Roziati Zainuddin Dr Zuraidah Mohd Don Dr Othman O. Khalifa Mumtaz Begum Peer Mustafa Mohammad Abu Shariah Noraini Mohamad Department of Software Engineering, Faculty of Computer Science & Information Technology University of Malaya, 50603 Kuala Lumpur ACKNOWLEDGEMENT This project is funded by University of Malaya under the University of Malaya Research Grant Scheme (UMRG) (RG019-09ICT). Total amount funded RM 116,000.00 RESEARCH BACKGROUND A text-to-speech (TTS) synthesizer allows the computer to convert written text input to voice output while a Speech Recognition system converts input speech to text output. This research comprises two complementary sub-projects as follow:- I HMM-based Malay Emotional Synthesized Speech In the typical speech synthesizer, prosody information affects the pitch contours and duration factors of the sounds being generated in response to text input. To a greater or lesser degree, all of the current synthesis techniques sound unnatural unless prosody information is added. Prosody refers to the rhythmic and intonation aspects of a spoken language which includes a combination of pitch, duration and intensity. In natural speech, prosodic features can be manipulated to express different emotions [1,2]. There are many approaches to generate the required prosody in synthesized emotional speech, such as corpus- based, rule-based and template-based methods [3,4,5,6]. However, the most commonly used approach is the rule- based approach as it is most computationally efficient, even though the output may sound unnatural. In [7] we presented a hybrid technique to enhance the quality of the rule-based approach to generate prosody for Malay speech synthesis by integrating prosody parametric manipulation (a form of rule-based approach) with template parametric manipulation so as to increase the intonation variability of the synthesized output. The system generates reasonably recognizable synthesized speech with the selected vocal emotions while not requiring large databases or corpora and remaining computationally efficient. However, the amount of work to derive the parameters for additional rules and templates for the different emotional styles becomes prohibitive for more accurate prosody modeling. To overcome this problem recent works employ synthesis based on Hidden Markov Models (HMM). Proposed by Tokuda and Masuko in 1995 [8], HMM based synthesis makes use of powerful statistical modeling to synthesise high quality speech. HMMs have successfully been applied to modeling sequences of speech spectra in speech recognition systems, and the performance of HMM-based speech recognition systems have been improved by techniques which utilize the flexibility of HMMs: context depend modeling, dynamic feature parameters, mixtures of Gaussian densities, tying mechanism, speaker and environment adaptation techniques. Most of such systems utilizing HMM for constructing TTS systems are based on waveform concatenation techniques. In the proposed approach, on the contrary, speech parameter sequences are generated from HMM