Prahallad et al. / J Zhejiang Univ SCI 2005 6A(11):1354-1361 1354 A simple approach for building transliteration editors for Indian languages PRAHALLAD Lavanya, PRAHALLAD Kishore, GANAPATHIRAJU Madhavi (Institute for Software International, Carnegie Mellon University, Pittsburgh, PA 15217, USA) E-mail: lavanyap@cmu.edu; skishore@cs.cmu.edu; madhavi@cs.cmu.edu Received Aug. 5, 2005; revision accepted Sept. 10, 2005 Abstract: Transliteration editors are essential for keying-in Indian language scripts into the computer using QWERTY keyboard. Applications of transliteration editors in the context of Universal Digital Library (UDL) include entry of meta-data and diction- aries for Indian languages. In this paper we propose a simple approach for building transliteration editors for Indian languages using Unicode and by taking advantage of its rendering engine. We demonstrate the usefulness of the Unicode based approach to build transliteration editors for Indian languages, and report its advantages needing little maintenance and few entries in the mapping table, and ease of adding new features such as adding letters, to the transliteration scheme. We demonstrate the trans- literation editor for 9 Indian languages and also explain how this approach can be adapted for Arabic scripts. Key words: Transliteration editor, Indian languages, Universal Digital Library (UDL) doi:10.1631/jzus.2005.A1354 Document code: A CLC number: TP391 INTRODUCTION Transliteration editors are essential for keying-in Indian language scripts into the computer using QWERTY keyboard. Applications of transliteration editors in the context of Universal Digital Library (UDL) include entry of meta-data and dictionaries for Indian languages. In this paper we propose a simpler approach for building transliteration editors for Indian languages using Unicode and by taking advantage of its rendering engine available in Windows XP and Linux operating systems. We use the transliteration scheme referred to as IT3 developed by IISc Banga- lore and Carnegie Mellon University to represent the Indian language scripts. The Indian language scripts are syllabic in nature and consist of V, CV, CCV and CCCV type of units, where C is a consonant and V is a vowel. The prop- erty of these scripts in that a syllable always ends with a vowel makes it easy to identify the syllables using vowels as anchor points. To render the syllables on the computer screens, we use Unicode, and the Unicode rendering engine available in Windows XP and Linux operating sys- tems. To display a CV unit, we concatenate the UTF-8 sequence of C and V to cause the Unicode rendering engine to render appropriate shape for CV. To display a CCV unit, we need to render the con- sonant cluster, so a special character called Ha- lant/Viraam ($) is introduced between every two consonants. So to render CCV we concatenate the UTF-8 sequence of C$CV. To display CCCV type of unit, we concatenate the UTF-8 sequence of C$C$CV. In these syllables, if the vowel is of type schwa (short vowel /a/) then it is nullified as the last consonant in the syllable inherits it by default. Every consonant in the Indian language scripts inherits schwa and so Unicode representation too. However, if the vowel is a non-schwa, then the UTF-8 sequence of Maatra of the corresponding vowel is used. A Maatra is a modified shape of a vowel when it is combined with a consonant. Each vowel has only one Maatra. In this paper, we demonstrate the usefulness of such a simple scheme to train transliteration editors for Indian languages, and report its advantages needing few entries in the mapping table, little Journal of Zhejiang University SCIENCE ISSN 1009-3095 http://www.zju.edu.cn/jzus E-mail: jzus@zju.edu.cn