Text Normalization and Diphone Preparation for Bangla Speech Synthesis Muhammad Masud Rashid 1 , Md. Akter Hussain 2 , M. Shahidu r Rahman 3 Shahjalal University of Science and Technology, Sylhet 3114, Bangladesh 1 masudcoder@yahoo.com, 2 akter.1985@yahoo.com, 3 rahmanms@sust.edu Abstract–This paper presents methodologies involved in text normalization and diphone preparation for Bangla Text to Speech (TTS) synthesis. A Concatenation based TTS system comprises basically two modules- one is natural language processing and the other is Digital Signal Processing (DSP). Natural language processing deals with converting text to its pronounceable form, called Text Normalization and the diphone selection method based on the normalized text is called Grapheme to Phoneme (G2P) conversion. Text normalization issues addressed in this paper include tokenization, conjuncts, null modified characters, numerical words, abbreviations and acronyms. Issues related with diphone preparation include diphone categorization, corpus preparation, diphone labeling and diphone selection. Appropriate rules and algorithms are proposed to tackle all the above mentioned issues. We developed a speech synthesizer for Bangla using diphone based concatenative approach which is demonstrated to produce much natural sounding synthetic speech. Index Terms– Text normalization, diphone, grapheme-to- phoneme, speech synthesis, sentence analysis. I. INTRODUCTION A text to speech synthesizer is now an important part of information technology because it has integrated language and speech for human computer interaction. Creation of synthetic voice from text is usually referred with the general term text-to-speech though it requires a wide range and variety of procedures. Voice technology applications have created a growing demand for multi- lingual, multi-voice, multi-style speech synthesis system. There are many techniques available for speech synthesis like formant synthesis, concatenative synthesis, articulacy synthesis [1, 2]. The formant synthesis uses fundamental frequency, voicing, noise levels instead of human speech samples to create a synthetic waveform of speech and the concatenative synthesis uses segments of recorded human speech. Concatenative synthesis has subtypes like unit selection and diphone synthesis where both have advantages and weaknesses. Unit selection stores speech unit like phone, half-phone, diphone, word etc and index them. At runtime best chain of units are determined by the selection algorithm. It requires large size database to store units and as the optimal search and/or selection algorithms used are not 100% reliable, both high and low quality synthesis is produced. Diphone synthesis uses a minimal speech storage that contains all diphones (two adjacent half-phones, cut in the middle, joined into one unit) of a language, applies little DSP and uses an easy to implement selection algorithm. Huge works have already been done on TTS for many European languages [1, 2, 3]. However, for Bangla languages, speech synthesis is yet to attain the level for direct large-scale applications. As per our knowledge, two complete systems have been reported. C-DAC, Kolkata has developed a Bangla TTS system named Bangla Vaani [4]. Very recently, CRBLP of BRAC University has released another Bangla TTS, Katha [5], which is built under the Festival framework [6, 7] using unit selection. In an attempt to synthesize speech from Bangla text, Seddiqui et al reported normalization process in [3]. The most recent work on text normalization can be found in [6] where they have identified the semiotic classes and have written a set of rules for tokenization. In addition with this special set of words, tokenization of null modified vowels (consonants embedded with the inherent vowel) has been described in this paper which is important and indeed a challenging task for a TTS. We proposed rules and techniques to accomplish the normalization task. With diphone concatenation, less memory is needed, but the sample collecting and labeling procedures are more difficult. The procedures of diphone preparation and diphone labeling are also discussed in this paper. When the normalized text is processed with the proposed diphone based synthesis method, the system is found to produce intelligible and much natural sounding speech. As seen in the simplified block diagram of a TTS system in Fig. 1, the contribution of this paper is involved with the first and second stage. Figure 1: Block diagram of a Diphone based TTS. Speech Output Concatenative/ Waveform synthesis Grapheme to Phoneme/ Diphone Selection Text Analysis/ Text Normalization Text JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 551 © 2010 ACADEMY PUBLISHER doi:10.4304/jmm.5.6.551-559