IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012 196 ISBN: 978-81-909042-2-3 ©2012 IEEE Probabilistic Language Model for Template Messaging based on Bi-Gram Rina Damdoo 1 , Urmila Shrawankar 2 Reseach Student 1 IEEE Student Member 2 Department of Computer Science and Engineering G. H. Raisoni College of Engineering Nagpur, MS, INDIA rinadamdoo@rediffmail.com 1 , urmila@ieee.org 2 Abstract—This paper reports the benefits of Probabilistic language modeling in template messaging domain. Through a Statistical Machine Translation (SMT) sentences written with short forms, misspelled words and chatting slang can be corrected. Given a source-language (e.g., Short message) sentence, the problem of machine translation is to automatically produce a target-language (e.g., Long form English) translation, to be used by the young generation for messaging. The main goal behind this project is to analyze the improvement in efficiency as the size of bilingual corpus increases. Machine learning and translation systems, dictionary and textbook preparations, patent and reference searches, and various information retrieval systems are the main applications of the project. Keywords- Language Model, Machine Translation, N-gram, Probability distribution table(PDT), Statistical Machine Translation (SMT), Text Normalization I. INTRODUCTION Internet users have popularized, Internet slang (Internet short-hand, netspeak or chatspeak), a type of slang that have benefited in many cases. Such terms often originate with the purpose of saving keystrokes. Many people use the same abbreviations in texting and instant messaging (u mean you), and social networking websites. Acronyms, keyboard symbols and shortened words are often used as methods of abbreviation in Internet slang. New dialect of slang, such as leet or Lolspeak develop as in-group memes rather than time savers. Secondly, over past few years social networks, chat rooms and forums have become the most important websites for users to share information about their life, work and interests. This new way of communication has evolved in such a way that they all share a casual common language. The users write on these sites as if they were writing SMS messages on their mobile phones, without paying attention to correct spelling or moreover, using user-created abbreviations for common phrases, e.g. “how are you?” is commonly written as “h r u?”. For these reasons, existing natural language processing tools cannot process the generated content found on the websites. Simple tools like dictionaries are not entirely satisfying, because the same abbreviation may have several expansions with different meanings (“2” could either mean “too”, “to” or “two”) and a context analysis evaluation should be made to choose the right definition. A machine translation system may address this challenge because it considers both the translation model, which would offer the different meanings for the same abbreviation or misspelled word, and the context analysis, which would consider the current context to choose the best translation. Figure 1. System model for the project