2017 20th International Conference of Computer and Information Technology (ICCIT), 22-24 December, 2017 Bangla Grapheme to Phoneme Conversion Using Conditional Random Fields Shammur Absar Chowdhury University of Trento, Italy shammur.chowdhury@unitn.it Firoj Alam QCRI, Qatar fialam@hbku.edu.qa Naira Khan Dhaka University, Bangladesh nairakhan@du.ac.bd Sheak R. H. Noori DIU, Bangladesh drnoori@daffodilvarsity.edu.bd Abstract —Integrated with handheld devices, toys, KIOSKs, and call centers, Text to Speech (TTS) and Speech Recognition (SR) have become widely used applications in everyday life. One of the core components of said applications is Grapheme to Phoneme (G2P) conversion. The task at hand is the mapping of the written form to the spoken form, i.e. mapping one sequence to another. In Natural Language Processing (NLP), it is typically referred to as a sequence to sequence labeling task. The task however, is a language dependent one and has primarily been implemented for English and similar resource- rich languages. In comparison, very little has been done for digitally under-resourced languages such as Bangla (ethnonym: Bangla; exonym: Bengali). The current state-of-the-art Bangla Grapheme to Phoneme conversion is limited to rule-based and lexicon based approaches, the development of which requires a significant contribution of linguistic experts. In this paper, we propose a data-driven machine learning approach for Bangla G2P conversion. We evaluate the existing rule based approaches and design a machine learning model using Conditional Ran- dom Fields (CRFs). To train the machine learning models we have only used character level contextual features due to the fact that extracting hand crafted features requires specialized knowledge. We have evaluated the systems using two publicly available datasets. We have obtained promising results with a phoneme error rate of 1.51% and 14.88% for CRBLP and Google pronunciation lexicons, respectively. Keywords—Bangla, Conditional Random Fields, Pronunciation Generation, Grapheme to Phoneme (G2P) I. Introduction Although our daily interactions are primarily dominated by speech or spoken conversation as the primary mode of com- munication, written communication also occupies a signiicant space in the communication sphere of human civilization. As such it is necessary to access written speech even if one is visually impaired. Therefore, it is signiicantly vital for the visually impaired to have access to synthesized speech of a written text. For machine understanding and generation, specif- ically for speech synthesis, i.e., Text to Speech (TTS)) and Automatic Speech Recognition (ASR) systems, one important step is to provide a mapping between orthographic and phonetic representations. For said mapping task, we need to infer one from the other, i.e., from orthographic to phonetic form and vice-versa. The notion of G2P is the that it takes a word (i.e., orthographic representation) e.g., DUKE, and generates a phonemic or phonetic representation, e.g., /d uw k/. An example in Bangla is as follows: আেদশ /a d e sh/ (order). The G2P system examines the grapheme sequence and utilizes diferent rules/techniques to generate a phoneme sequence. In relevant literature, it is also referred to as a letter to sound mapping [1]. In the early days of computational G2P research, a typ- ical approach was to use a digitised pronunciation lexicon 1 , manually developed by lexicographers and linguists. For ex- ample, a publicly available pronunciation lexicon for English is the CMU Dictionary [2] 2 , and for Bangla it is the CRBLP Pronunciation Lexicon [3] 3 and Google's Bangla pronunciation lexicon [4]. The limitation of a lexicon-based approach is that an automated system is not able to provide a pronunciation of an unknown word. Another limitation is that it is memory intensive to load a large list of a lexical items, especially for hand-held devices. Another early approach, based on implementing a determin- istic system, utilised pronunciation rules devised by linguists. Some earlier work on the rule-based approaches for English can be found in [5], [6], [7], [8], [9]. For Bangla, the research is sparse and one of the seminal studies can be found in [10], later extended in the study of Alam et al. [3]. Other relevant research includes [11], [12]. Data-driven statistical machine learning approaches are not new, however, research eforts in said approach is sparse. The data-driven approach requires a lexicon containing an exhaustive list of the pronunciation of the words in order to train a machine learning model. For English, the earliest work is done by Sejnowski et al. [13], [14] using a feed-forward neural network, comprising one input, a hidden and an output layer. The alternative machine-learning based approach includes the use of decision trees [15]. A comparative study has been done in [16] using several algorithms. We discuss more details about diferent approaches in Section II. Compared to the research on English, the only eforts for Bangla that we are aware of was done by [17], in which they trained a machine learning model using 37K words. The model was developed to facilitate a transcriber and the reported accuracy is 81.5%. In this study, we explore a CRFs based machine learning approach for Bangla G2P conversion. Our contributions include: 1) we provide a systematic comparison with existing rule based approaches, such as that in [10] and [3], using publicly available pronunciation lexicons like CRBLP [3]. 1 A correspondences between orthography and its pronunciation of a word 2 https://github.com/cmusphinx/cmudict 3 Available as part of a Bangla Text to Speech system: https://github.com/firojalam/Katha-Bangla-TTS 978-1-5386-1150-0/17/$31.00 © 2017 IEEE