TOWARD TEXT MESSAGE NORMALIZATION: MODELING ABBREVIATION
GENERATION
Deana Pennell and Yang Liu
Computer Science Department
The University of Texas at Dallas
{deana, yangl}@hlt.utdallas.edu
ABSTRACT
This paper describes a text normalization system for deletion-based
abbreviations in informal text. We propose using statistical classi-
fiers to learn the probability of deleting a given character using fea-
tures based on character context, position in the word and containing
syllable, and function within the word. To ensure that our system is
robust to different and previously unseen abbreviations for a word,
we generate multiple abbreviation hypotheses for a word using the
predictions from the classifiers. We then reverse the mappings to
enable recovery of English words from the abbreviations. Differ-
ent knowledge sources are used to disambiguate word candidates:
abbreviation likelihood, length, and language model scores. Our re-
sults show that this approach is feasible and warrants further explo-
ration in the future.
Index Terms— noisy text processing, twitter, text normaliza-
tion, abbreviation modeling
1. INTRODUCTION
Text messaging (SMS) is a rapidly growing form of alternative com-
munication for cell phones. This popularity has caused safety con-
cerns leading many US states to pass laws prohibiting texting while
driving. The technology is also difficult for users with visual im-
pairments or physical handicaps to use. We believe a text-to-speech
(TTS) system for cell phones can decrease these problems to pro-
mote safe travel and ease of use for all users.
Text message lingo is also similar to the chatspeak that is pro-
lific on forums, blogs and chatrooms. Screen readers will also ben-
efit from such technology, enabling visually impaired users to take
part in this aspect of internet culture. Normalizing informal text may
help with returning relevant search results, summarization, and key-
word, topic, sentiment and emotion detection, which are currently
receiving a lot of attention in the informal text domain.
Text normalization is the usual first step for TTS. Normalization
of informal text is complicated by the large number of abbreviations
used. There is limited previous work on this problem. [1] used a ma-
chine translation (MT) approach for SMS normalization; however,
a large annotated corpus is required for such a supervised learning
method since the learning is performed at the word level.
In this paper, we propose a framework for use as a starting point
for a text message normalization system. Our system uses statistical
methods to determine whether or not to remove a character based on
contextual information, and creates a list of possible abbreviations
for English words. We then reverse the mapping to create a look-up
table from text message lingo to proper English. This enables recog-
nition of abbreviations that did not appear in training. Our results
suggest this is a reasonable step toward a normalization system.
2. RELATED WORK
Text normalization is an important first step for any text-to-speech
(TTS) system and has been widely studied in many formal domains.
[2] provides a good resource for text normalization and its associated
problems. Table 1 shows the generally accepted processing methods
for unknown words with examples from both formal and informal
domains. The constant evolution of informal text presents new chal-
lenges for normalization. Spell-checking algorithms are mostly inef-
fective on this data, perhaps because they do not account for the phe-
nomena in text messages. They instead focus on single typographic
errors using edit distance, such as [3], or combine edit distance and
pronunciation modeling, such as [4].
Method Formal Example Texting Example
as chars RSVP “cu” (see you)
as word NATO “l8r” (later)
expand Corp. “prof” (professor)
combine WinNT “neway” (anyway)
Table 1. Methods for processing unseen tokens in normalization.
The desire to translate SMS from one language to another led
Bangalore et al. [5] to use consensus translations to bootstrap a trans-
lation system for instant messages and chat rooms where these ab-
breviations are common. Aw et al. [1] view text messaging lingo
as if it were another language with its own words and grammar and
produced grammatically correct English sentences using a statisti-
cal MT system. Kobus et al. [6] incorporate a second phase in the
translation model that maps characters in the texting abbreviation to
phonemes, which are viewed as the output of an automatic speech
recognition (ASR) system. They use a non-deterministic phonemic
transducer to decode the phonemes into English words.
The work of Choudhury et al. [7] describes a supervised noisy
channel model using HMMs for text message normalization. Cook
and Stevenson [8] modified this work to create an unsupervised noisy
channel approach. They created probabilistic models for common
abbreviation types and chose the English word with the highest prob-
ability after combining the models as the standard form. Yang et
al. [9] work with abbreviation generation for spoken Chinese rather
than for English text messages, but their process is quite similar to
ours. They use conditional random fields (CRFs) as a binary classi-
fier to determine the probability of removing a Chinese character to
form an abbreviation. They rerank the resulting abbreviations by us-
ing a length prior learned from their training data and co-occurrence
of the original word and generated abbreviation using web search.
Finally, this work builds on our past work [10]. Previously, we
5364 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011