IEEE-International Conference On Advances In Engineering, Science And Management (ICAESM -2012) March 30, 31, 2012 196
ISBN: 978-81-909042-2-3 ©2012 IEEE
Probabilistic Language Model for Template
Messaging based on Bi-Gram
Rina Damdoo
1
, Urmila Shrawankar
2
Reseach Student
1
IEEE Student Member
2
Department of Computer Science and Engineering
G. H. Raisoni College of Engineering
Nagpur, MS, INDIA
rinadamdoo@rediffmail.com
1
, urmila@ieee.org
2
Abstract—This paper reports the benefits of Probabilistic
language modeling in template messaging domain. Through a
Statistical Machine Translation (SMT) sentences written with
short forms, misspelled words and chatting slang can be
corrected. Given a source-language (e.g., Short message)
sentence, the problem of machine translation is to automatically
produce a target-language (e.g., Long form English) translation,
to be used by the young generation for messaging. The main goal
behind this project is to analyze the improvement in efficiency as
the size of bilingual corpus increases. Machine learning and
translation systems, dictionary and textbook preparations,
patent and reference searches, and various information retrieval
systems are the main applications of the project.
Keywords- Language Model, Machine Translation, N-gram,
Probability distribution table(PDT), Statistical Machine
Translation (SMT), Text Normalization
I. INTRODUCTION
Internet users have popularized, Internet slang (Internet
short-hand, netspeak or chatspeak), a type of slang that have
benefited in many cases. Such terms often originate with the
purpose of saving keystrokes. Many people use the same
abbreviations in texting and instant messaging (u mean you),
and social networking websites. Acronyms, keyboard symbols
and shortened words are often used as methods of
abbreviation in Internet slang. New dialect of slang, such as
leet or Lolspeak develop as in-group memes rather than time
savers.
Secondly, over past few years social networks, chat rooms
and forums have become the most important websites for
users to share information about their life, work and interests.
This new way of communication has evolved in such a way
that they all share a casual common language. The users write
on these sites as if they were writing SMS messages on their
mobile phones, without paying attention to correct spelling or
moreover, using user-created abbreviations for common
phrases, e.g. “how are you?” is commonly written as “h r u?”.
For these reasons, existing natural language processing tools
cannot process the generated content found on the websites.
Simple tools like dictionaries are not entirely satisfying,
because the same abbreviation may have several expansions
with different meanings (“2” could either mean “too”, “to” or
“two”) and a context analysis evaluation should be made to
choose the right definition. A machine translation system may
address this challenge because it considers both the translation
model, which would offer the different meanings for the same
abbreviation or misspelled word, and the context analysis,
which would consider the current context to choose the best
translation.
Figure 1. System model for the project