Hidden Markov Model Based Identification of Transliterated Regional Language Words in Text Documents Achuth Sankar S. Nair 1 , Vrinda V. Nair 2 , Vinod Chandra S. S. 3 1 Centre for Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India 2 Department of Electronics & Communications, College of Engineering Thrissur, Kerala, India 3 Department of Computer Science, College of Engineering Thiruvananthapuram, Kerala, India { sankar.achuth, vinodchandrass }@gmail.com, nairvrinda@rediffmail.com Abstract Text in roman script is today popular in the internet among non-English speaking countries, as the script is almost universally enabled in text processors. In countries like India which is a linguistic cauldron, it is very common to see English text in email messages and chat transcripts, with generous sprinkling of local language words in roman script. Dubbed as Hinglish (Hindi and English), Manglish (Malayalam and English) etc., these roman transliteration of non-English languages contribute a major noise in analyzing English text in these countries / groups. The present work reports the initial result of developing an HMM-based classifier for such linguistic noise in English text. The system reported is trained to classify English and non- English words, however it can be expanded to a distinct collection of models for various languages and provide a ‘linguistic coloring’ feature in text editors. 1. Introduction Noisy text (we confine to English), is traditionally due to spelling errors, abbreviations, non-standard words, false starts, pause filling words etc. However in the context of non-English speaking countries like India, major noise originates from the use of transliterated regional language words. This phenomenon is seen widely in the internet chat rooms and email messages. In the phonetic Indian languages, typing overhead is more compared to English and hence there is a natural inclination to use roman script to produce so called Hinglish (Hindi and English), Manglish (Malayalam and English), Benglish (Bengali and English) etc. This is very true of young generation non-resident Indians who may be familiar with spoken form of their mother language, but not its script, which leads to text such as follows. “Hi Dadiji, How are you? How is chottu ? I am having my vacation here and I will soon be sending some special Kapada and Makan for you…” When this kind of text is to be machine processed, word models of various languages are required to sieve out the underlined words. We consider them along with non- English (mistakes, fillers etc.) noise together, but there is a script to clarify the non-English using distinct models for each regional language concerned. We report in this paper a preliminary investigation using stochastic models. The Hidden Markov Model (HMM) is the stochastic tool used. In our approach, we use a first-order Hidden Markov Model [Jelinek, 1976; Rabiner, 1989]. HMMs have a set of states S= {S1, S2, .., Sn} which emit symbols with different probabilities. There is a transition probability between each succeeding states. In our case each state is supposed to emit a set of letters and the whole model generates a word that is constituted in a document. Figure 1 shows the overview of an HMM used. The assumption of first order HMM is not complete or final. This paper is organized as follows. Section 2 discusses some related works. Section 3 describes the architecture of HMM for identification of lingual noise arising due to local language words in roman scripts from a document. Section 4 presents results and conclusions. 2. Related Works There are some models, which can be used to capture word dependencies and to perform inference on sequences, be they whole documents or short passages [Denoyer et al., 2001]. They allow extending the classical paradigms of information retrieval by considering sequences of text elements instead of the classical bag-of-words representation. Another classification method is seen for biological data [Chen, et al., 2006]. This uses a support vector machine-trained classifier, followed by a novel phrase - based clustering algorithm. This clustering step 87 AND 2007 87