Revisiting Automatic Transliteration Problem for Code-Mixed Romanized Indian Social Media Text Kunal Chakma Computer Science & Engineering Department National Institute of Technology Agartala Jirania, Tripura, India kchax4377@gmail.com Abstract Although automatic Transliteration for Indian languages is a well studied paradigm, but available transliteration techniques fail in the Indian social media context due to phenomena such as wordplay, creative spelling, code- mixing, and phonetic romanized typing; all implying that transliteration for Indian social media text has to be revisited. The paper reports an initial study on automatic transliteration for a Facebook message corpus in mixed English-Bengali-Hindi for restoration of Hindi and Bengali code-mixed words into Devanagari and Bengali script respectively. Ke ywords : transliteration, code-mixing, social media text 1. Introduction Looking at code-mixing in social media text (SMT) is overall a new research strand. SMT is characterized by having a high percentage of spelling errors and containing creative spellings (gr8 for ‘ great’), phonetic typing, word play (goooood for ‘ good’), and abbreviations (OMG for ‘Oh my God! ’). Non-English speakers do not always use Unicode to write social media text in their own language, frequently insert English elements (through code-mixing and Anglicism), and often mix multiple languages to express their thoughts, making automatic language detection in social media texts a very challenging task, which only recently has started to attract attention. Different types of language mixing phenomena have, however, been discussed and defined by several linguists, with some making clear distinctions between phenomena based on certain criteria, while others use ‘code -mixing’ or ‘code- switching’ as umbrella terms to include any type of language mixing - see, e.g., Muysken (2000) or Gafaranga and Torras (2002) - as it is not always Amitava Das Human Language Technologies (HiLT) lab University of North Texas, USA amitava.santu@gmail.com clear where borrowings/Anglicisms stop and code-mixing begins (Alex, 2008). An essential prerequisite for any kind of automatic text processing is to be able to identify the language in which a specific segment is written. Here we will in particular address the problem of word level language identification in social media texts. Available language detectors fail for these texts due to the style of writing and the brevity of the texts, despite a common belief that language identification is an almost solved problem (McNamee, 2005). But language detection at word level is a separate problem altogether. Here in this paper we are only concentrating on transliteration. Automatic transliteration for the Code-Mixed romanized Indian SMT is particularly problematic because there is no standard of romanization. People are quite creative in their spellings. There are various alternative phonetic spellings available for a single word. For example: आँखो ( eyes) aankhon/aankho/ankho/ankhon ये (this) iye/yeh/ye/y অনেক (multiple) anek/onek/onk/oneeek অনেকা (waiting) opekkha/opekha/oppekha Even the reverse is also true. There are several cases when one Romanized word could be transliterated into multiple possible outputs based on context: mak काम (worm /)कम( less) aste আসনে (to come) / আনে (slowly) beche বেনে (chosen) / বেেনে (alive) Moreover very often people mix up numerals into their Romanized phonetic representations. Those cases are even more challenging. अचछा (okay) a66a অনাোল (mess) ogo6alo একটু (some) ek2