Character Sequence Modeling for Transliteration Manoj Kumar Chinnakotla and Om P. Damani Department of Computer Science and Engineering, Indian Institute of Technology, Bombay Mumbai, India {manoj,damani}@cse.iitb.ac.in Abstract The Character Sequence Modeling (CSM), typically called the Language Modeling, has not received sufficient attention in the current transliteration research. We discuss the impact of var- ious CSM factors like word granularity, smoothing technique, corpus variation, and word origin on the transliteration accuracy. We demonstrate the importance of CSM by showing that for transliterat- ing into English, for two very different languages, Hindi and Persian, systems employing only monolingual resources and simple non-probabilistic character mappings achieve accuracy close to that of the baseline statistical systems employing parallel transliteration pairs. It shows that a reasonable transliteration system can be built for resource scarce languages that lack large parallel corpora. 1 Introduction Transliteration is the process of mapping a writ- ten word from a language-script pair to another language-script pair. For example, Hindi word and corresponds to English words param and share respectively. Transliterating a word from the language of its origin to a foreign language is called Forward Transliteration, while transliterating a loan-word written in a foreign lan- guage back to the language of its origin is called Backward Transliteration. Our focus is on the general purpose transliteration into English from resource scarce languages for which very large parallel corpora of transliteration pairs does not exist. By general purpose, we mean that the sys- tem should do both forward and backward translit- eration. Given the wide variety of script-language pairs in the world, many different methods have been proposed for transliteration: grapheme based (Ganesh et al., 2008; AbdulJaleel and Larkey, 2003; Sherif and Kondrak, 2007; Haizhou et al., 2004; Kumaran and Kellner, 2007; Ekbal et al., 2006), phoneme based (Surana and Singh, 2008; Virga and Khudanpur, 2003; Knight and Graehl, 1997), and hybrid (Oh and Choi, 2002). Some of these systems are statistical in nature, while others are rule-based. One thing that is com- mon to these systems is the direct or indirect use of the Character Sequence Modeling (CSM), typi- cally called the Language Modeling. In transliteration literature (Surana and Singh, 2008; Kumaran and Kellner, 2007; Sherif and Kondrak, 2007), other than mentioning N for the N-gram model, many papers do not give other important details like the corpus over which the model was computed, back-off scheme employed, and what constitutes a word in such a model . While these factors are important for any appli- cation of language model, we argue that Char- acter Sequence Modeling is different from tradi- tional language modeling and its exploration can pay rich dividends. We present the results for grapheme to grapheme based general purpose Hindi to English and Persian to English transliteration systems. Our systems use CSM on source side for word ori- gin identification, a simple non-probabilistic char- acter mapping to generate transliteration candi- dates, and then use CSM on the target side to rank the candidates. We demonstrate the impor- tance of CSM by noting that our systems employ- ing only monolingual resources and simple non- Proceedings of ICON 2009: 7th International Conference on Natural Language Processing Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009