A Chunk-based n-gram English to Thai Transliteration 123 A Chunk-based n-gram English to Thai Transliteration 121 A Chunk-based n-gram English to Thai Transliteration Wirote Aroonmanakun, Non-member ABSTRACT In this study, a chunk-based n-gram model is proposed for English to Thai transliteration. The model is compared with three other models: table lookup model, decision tree model, and statistical model. The chunk-based n-gram model achieves 67% word accuracy, which is higher than the accuracy of other models. Performances of all models are slightly increased when an English grapheme to phoneme is included in the system. However, the accuracy of the system does not suffice to be a public transliteration tool. The low accuracy of the system is caused by the poor performance of the English grapheme to phoneme module and the inconsistency of pronunciation in the training data. Some suggestions are provided for further improvement. Keywords: English-Thai Transliteration 1. INTRODUCTION English to Thai transliteration is a way to write English words in the Thai alphabet. While English has 26 characters for 24 consonant and 20 vowel sounds 1 , Thai has 44 characters for 21 consonant sounds, and 19 characters (including 3 consonant characters) for 24 vowel sounds (including 6 diphthongs), and 4 characters for tone markers. It is quite usual for an English word to be transliterated in Thai differently. For example, the word “internet” are written as อินเตอรเน็ต, อินเตอรเนต, อินเตอรเนท, อินเตอรเน็ท, อินเทอรเน็ต, อินเทอรเนต, or อินเทอรเน็ท. To standardize the transliteration, the Thai Royal Institute issued regulations of English-Thai transliteration in 1982. Nevertheless, many people tend to transliterate English words on their own rather than adhering to the regulations. But if an English-Thai transliteration software that conforms to the Royal Institute’s guideline is available to the public, the diversity should be low. In this study, we aim to develop such a system. A corpus of transliterated words is created by collecting English and Thai word pairs from books published by the Royal Institute. A total of 8,181 word pairs are used in this study. In each word pair, Thai characters are aligned with English corresponding characters. Alignments between English and Thai characters are first assigned by a program and then manually corrected. It is possible that more than one character in English or Thai is aligned, e.g. ‘th’-‘ท’, ‘ia’-‘เ. ีย’. Examples of aligned characters between word pairs are shown below. These data will be used for training the transliteration systems. Manuscript received on June 16, 2007; revised on August 20, 2006. The author is with Dept. of Linguistics, Faculty of Arts, Chulalongkorn University; E-mail: awirote@chula.ac.th ล/ ิ/ ท/ โ./ ซ/ อ/ ล/ ส / l/ i/ th/ o/ s/ o/ l/ s/ ล/ ิ/ ท/ ัว/ น/ เ. ีย/ l/ i/ th/ ua/ n/ ia/ ล/ ิ/ ว/ เ.อ/ ร/ พ/ ู/ ล/ l/ i/ v/ e/ r/ p/ oo/ l/ ล/ ิ/ ฟว/ ิ/ ง/ ส/ ต/ โ./ น/ #/ l/ i/ v/ i/ ng/ s/ t/ o/ n/ e/ ล/ ี/ ว/ ี/ อั/ ส/ l/ i/ v/ i/ u/ s/ This paper first reviews previous models of transliteration systems. Table lookup, decision tree, and statistical models are briefly discussed. Then, a new approach of chunk-based n-gram model is described in section 3. The results when using each model are reported and compared in section 4. Since knowing English pronunciation is usually useful for transliteration, all the models are re-tested by applying a module of English grapheme to phoneme. The new results are reported in section 5. Though the chunk-based n- gram model performs better than other models, the accuracy is not high enough to be used as a tool for the public. At the end, we will review and discuss the problems for further improvements. 2. PREVIOUS RESEARCH Since transliteration is basically a process of transforming one writing system into another writing system, approaches used in any transliteration systems as well as those used in grapheme to phoneme systems are relevant. In this study, three different approaches, namely table lookup, decision trees, and statistical model, are reviewed and implemented in this study.