Morpheme-Based Grapheme to Phoneme Conversion Using Phonetic Patterns and Morphophonemic Connectivity Information BYEONGCHANG KIM Uiduk University and GARY GEUNBAE LEE AND JONG-HYEOK LEE Pohang University of Science and Technology Both dictionary-based and rule-based methods on grapheme-to-phoneme conversion have their own advantages and limitations. For example, a large sized phonetic dictionary and complex morphophonemic rules are required for the dictionary-based method and the LTS (letter to sound) rule-based method itself cannot model the complete morphophonemic constraints. This paper describes a grapheme-to-phoneme conversion method for Korean using a dictionary-based and rule-based hybrid method with a phonetic pattern dictionary and CCV (consonant consonant vowel) LTS (letter to sound) rules. The phonetic pattern dictionary, standing for the dictionary-based method, contains entries in the form of a morpheme pattern and its phonetic pattern. The patterns represent candidate phonological changes in left and right boundaries of morphemes. Obviously, the CCV LTS rules stand for the rule-based method. The rules are in charge of grapheme-to-phoneme conversion within morphemes. The conversion method consists of mainly two steps including morpheme to phoneme conversion and morphophonemic connectivity check, and two preprocessing steps including phrase break prediction and morpheme normalization. Phrase break prediction presumes phrase breaks using the stochastic method on part- of-speech (POS) information. Morpheme normalization is to replace non-Korean symbols with their corresponding standard Korean graphemes. In the morpheme-phoneticizing module, each morpheme in the phrase is converted into phonetic patterns by looking it up in the phonetic pattern dictionary. Graphemes within a morpheme are grouped into CCV units and converted into phonemes by the CCV LTS rules. The morphophonemic connectivity table supports grammaticality checking of the two adjacent phonetic morphemes. In experiments with a non-Korean symbol free corpus of 4,973 sentences, we achieved a 99.98% grapheme-to-phoneme conversion performance rate and a 99.0% sentence conversion performance rate. With a broadcast news corpus of 621 sentences, 99.7% of the graphemes and 86.6% of the sentences are correctly converted. The full Korean TTS (Text-to-Speech) system is now being implemented using this conversion method. Categories and Subject Descriptors: I.2.7 [Artificial Intelligence] Natural Language Processing – speech recognition and synthesis; G.3 [Probability and Statistics]: Statistical computing; H.5.2 [User Interfaces]: Natural language General Terms: Languages, Experimentation, Performance Additional Key Words and Phrases: Text-to-speech system, grapheme-to-phoneme conversion, morpho- phonemic modeling, phonetic pattern dictionary, CCV LTS rule This research was supported partially by the University Research Program of the Ministry of Information & Communication of Korea, and partially by the Ministry of Education of Korea for its financial support toward the Electrical and Computer Engineering Division at POSTECH through its BK21 program.. Authors' addresses: Department of Computer Engineering, Uiduk University, Kangdong, Kyongju, 780-713, South Korea; email: bckim@uiduk.ac.kr Permission to make digital/hard copy of part of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date of appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. © 2002 ACM 1530-0226/02/0300-0065 $5.00 ACM Transactions on Asian Language Information Processing, Vol. 1, No. 1, March 2002, Pages 65-82.