2017 20th International Conference of Computer and Information Technology (ICCIT), 22-24 December, 2017
Bangla Grapheme to Phoneme Conversion Using
Conditional Random Fields
Shammur Absar Chowdhury
University of Trento, Italy
shammur.chowdhury@unitn.it
Firoj Alam
QCRI, Qatar
fialam@hbku.edu.qa
Naira Khan
Dhaka University, Bangladesh
nairakhan@du.ac.bd
Sheak R. H. Noori
DIU, Bangladesh
drnoori@daffodilvarsity.edu.bd
Abstract —Integrated with handheld devices, toys, KIOSKs,
and call centers, Text to Speech (TTS) and Speech Recognition
(SR) have become widely used applications in everyday life. One
of the core components of said applications is Grapheme to
Phoneme (G2P) conversion. The task at hand is the mapping
of the written form to the spoken form, i.e. mapping one
sequence to another. In Natural Language Processing (NLP),
it is typically referred to as a sequence to sequence labeling
task. The task however, is a language dependent one and has
primarily been implemented for English and similar resource-
rich languages. In comparison, very little has been done for
digitally under-resourced languages such as Bangla (ethnonym:
Bangla; exonym: Bengali). The current state-of-the-art Bangla
Grapheme to Phoneme conversion is limited to rule-based and
lexicon based approaches, the development of which requires a
significant contribution of linguistic experts. In this paper, we
propose a data-driven machine learning approach for Bangla
G2P conversion. We evaluate the existing rule based approaches
and design a machine learning model using Conditional Ran-
dom Fields (CRFs). To train the machine learning models we
have only used character level contextual features due to the
fact that extracting hand crafted features requires specialized
knowledge. We have evaluated the systems using two publicly
available datasets. We have obtained promising results with a
phoneme error rate of 1.51% and 14.88% for CRBLP and Google
pronunciation lexicons, respectively.
Keywords—Bangla, Conditional Random Fields, Pronunciation
Generation, Grapheme to Phoneme (G2P)
I. Introduction
Although our daily interactions are primarily dominated by
speech or spoken conversation as the primary mode of com-
munication, written communication also occupies a signiicant
space in the communication sphere of human civilization. As
such it is necessary to access written speech even if one is
visually impaired. Therefore, it is signiicantly vital for the
visually impaired to have access to synthesized speech of a
written text. For machine understanding and generation, specif-
ically for speech synthesis, i.e., Text to Speech (TTS)) and
Automatic Speech Recognition (ASR) systems, one important
step is to provide a mapping between orthographic and phonetic
representations. For said mapping task, we need to infer one
from the other, i.e., from orthographic to phonetic form and
vice-versa. The notion of G2P is the that it takes a word
(i.e., orthographic representation) e.g., DUKE, and generates a
phonemic or phonetic representation, e.g., /d uw k/. An example
in Bangla is as follows: আেদশ /a d e sh/ (order). The G2P
system examines the grapheme sequence and utilizes diferent
rules/techniques to generate a phoneme sequence. In relevant
literature, it is also referred to as a letter to sound mapping [1].
In the early days of computational G2P research, a typ-
ical approach was to use a digitised pronunciation lexicon
1
,
manually developed by lexicographers and linguists. For ex-
ample, a publicly available pronunciation lexicon for English
is the CMU Dictionary [2]
2
, and for Bangla it is the CRBLP
Pronunciation Lexicon [3]
3
and Google's Bangla pronunciation
lexicon [4]. The limitation of a lexicon-based approach is that
an automated system is not able to provide a pronunciation
of an unknown word. Another limitation is that it is memory
intensive to load a large list of a lexical items, especially for
hand-held devices.
Another early approach, based on implementing a determin-
istic system, utilised pronunciation rules devised by linguists.
Some earlier work on the rule-based approaches for English
can be found in [5], [6], [7], [8], [9]. For Bangla, the research
is sparse and one of the seminal studies can be found in [10],
later extended in the study of Alam et al. [3]. Other relevant
research includes [11], [12].
Data-driven statistical machine learning approaches are not
new, however, research eforts in said approach is sparse.
The data-driven approach requires a lexicon containing an
exhaustive list of the pronunciation of the words in order to
train a machine learning model. For English, the earliest work is
done by Sejnowski et al. [13], [14] using a feed-forward neural
network, comprising one input, a hidden and an output layer.
The alternative machine-learning based approach includes the
use of decision trees [15]. A comparative study has been done
in [16] using several algorithms. We discuss more details about
diferent approaches in Section II.
Compared to the research on English, the only eforts for
Bangla that we are aware of was done by [17], in which
they trained a machine learning model using 37K words. The
model was developed to facilitate a transcriber and the reported
accuracy is 81.5%. In this study, we explore a CRFs based
machine learning approach for Bangla G2P conversion. Our
contributions include:
1) we provide a systematic comparison with existing rule
based approaches, such as that in [10] and [3], using
publicly available pronunciation lexicons like CRBLP [3].
1
A correspondences between orthography and its pronunciation of a word
2
https://github.com/cmusphinx/cmudict
3
Available as part of a Bangla Text to Speech system:
https://github.com/firojalam/Katha-Bangla-TTS 978-1-5386-1150-0/17/$31.00 © 2017 IEEE