Text Normalization and Diphone Preparation
for Bangla Speech Synthesis
Muhammad Masud Rashid
1
, Md. Akter Hussain
2
, M. Shahidu r Rahman
3
Shahjalal University of Science and Technology, Sylhet 3114, Bangladesh
1
masudcoder@yahoo.com,
2
akter.1985@yahoo.com,
3
rahmanms@sust.edu
Abstract–This paper presents methodologies involved in text
normalization and diphone preparation for Bangla Text to
Speech (TTS) synthesis. A Concatenation based TTS system
comprises basically two modules- one is natural language
processing and the other is Digital Signal Processing (DSP).
Natural language processing deals with converting text to its
pronounceable form, called Text Normalization and the
diphone selection method based on the normalized text is
called Grapheme to Phoneme (G2P) conversion. Text
normalization issues addressed in this paper include
tokenization, conjuncts, null modified characters, numerical
words, abbreviations and acronyms. Issues related with
diphone preparation include diphone categorization, corpus
preparation, diphone labeling and diphone selection.
Appropriate rules and algorithms are proposed to tackle all
the above mentioned issues. We developed a speech
synthesizer for Bangla using diphone based concatenative
approach which is demonstrated to produce much natural
sounding synthetic speech.
Index Terms– Text normalization, diphone, grapheme-to-
phoneme, speech synthesis, sentence analysis.
I. INTRODUCTION
A text to speech synthesizer is now an important part of
information technology because it has integrated
language and speech for human computer interaction.
Creation of synthetic voice from text is usually referred
with the general term text-to-speech though it requires a
wide range and variety of procedures. Voice technology
applications have created a growing demand for multi-
lingual, multi-voice, multi-style speech synthesis system.
There are many techniques available for speech synthesis
like formant synthesis, concatenative synthesis, articulacy
synthesis [1, 2]. The formant synthesis uses fundamental
frequency, voicing, noise levels instead of human speech
samples to create a synthetic waveform of speech and the
concatenative synthesis uses segments of recorded human
speech. Concatenative synthesis has subtypes like unit
selection and diphone synthesis where both have
advantages and weaknesses. Unit selection stores speech
unit like phone, half-phone, diphone, word etc and index
them. At runtime best chain of units are determined by
the selection algorithm. It requires large size database to
store units and as the optimal search and/or selection
algorithms used are not 100% reliable, both high and low
quality synthesis is produced. Diphone synthesis uses a
minimal speech storage that contains all diphones (two
adjacent half-phones, cut in the middle, joined into one
unit) of a language, applies little DSP and uses an easy to
implement selection algorithm. Huge works have already
been done on TTS for many European languages [1, 2,
3]. However, for Bangla languages, speech synthesis is
yet to attain the level for direct large-scale applications.
As per our knowledge, two complete systems have been
reported. C-DAC, Kolkata has developed a Bangla TTS
system named Bangla Vaani [4]. Very recently, CRBLP
of BRAC University has released another Bangla TTS,
Katha [5], which is built under the Festival framework [6,
7] using unit selection. In an attempt to synthesize speech
from Bangla text, Seddiqui et al reported normalization
process in [3]. The most recent work on text
normalization can be found in [6] where they have
identified the semiotic classes and have written a set of
rules for tokenization. In addition with this special set of
words, tokenization of null modified vowels (consonants
embedded with the inherent vowel) has been described in
this paper which is important and indeed a challenging
task for a TTS. We proposed rules and techniques to
accomplish the normalization task. With diphone
concatenation, less memory is needed, but the sample
collecting and labeling procedures are more difficult. The
procedures of diphone preparation and diphone labeling
are also discussed in this paper. When the normalized text
is processed with the proposed diphone based synthesis
method, the system is found to produce intelligible and
much natural sounding speech. As seen in the simplified
block diagram of a TTS system in Fig. 1, the contribution
of this paper is involved with the first and second stage.
Figure 1: Block diagram of a Diphone based TTS.
Speech
Output
Concatenative/
Waveform synthesis
Grapheme to Phoneme/
Diphone Selection
Text Analysis/
Text Normalization
Text
JOURNAL OF MULTIMEDIA, VOL. 5, NO. 6, DECEMBER 2010 551
© 2010 ACADEMY PUBLISHER
doi:10.4304/jmm.5.6.551-559