TOWARD TEXT MESSAGE NORMALIZATION: MODELING ABBREVIATION GENERATION Deana Pennell and Yang Liu Computer Science Department The University of Texas at Dallas {deana, yangl}@hlt.utdallas.edu ABSTRACT This paper describes a text normalization system for deletion-based abbreviations in informal text. We propose using statistical classi- fiers to learn the probability of deleting a given character using fea- tures based on character context, position in the word and containing syllable, and function within the word. To ensure that our system is robust to different and previously unseen abbreviations for a word, we generate multiple abbreviation hypotheses for a word using the predictions from the classifiers. We then reverse the mappings to enable recovery of English words from the abbreviations. Differ- ent knowledge sources are used to disambiguate word candidates: abbreviation likelihood, length, and language model scores. Our re- sults show that this approach is feasible and warrants further explo- ration in the future. Index Termsnoisy text processing, twitter, text normaliza- tion, abbreviation modeling 1. INTRODUCTION Text messaging (SMS) is a rapidly growing form of alternative com- munication for cell phones. This popularity has caused safety con- cerns leading many US states to pass laws prohibiting texting while driving. The technology is also difficult for users with visual im- pairments or physical handicaps to use. We believe a text-to-speech (TTS) system for cell phones can decrease these problems to pro- mote safe travel and ease of use for all users. Text message lingo is also similar to the chatspeak that is pro- lific on forums, blogs and chatrooms. Screen readers will also ben- efit from such technology, enabling visually impaired users to take part in this aspect of internet culture. Normalizing informal text may help with returning relevant search results, summarization, and key- word, topic, sentiment and emotion detection, which are currently receiving a lot of attention in the informal text domain. Text normalization is the usual first step for TTS. Normalization of informal text is complicated by the large number of abbreviations used. There is limited previous work on this problem. [1] used a ma- chine translation (MT) approach for SMS normalization; however, a large annotated corpus is required for such a supervised learning method since the learning is performed at the word level. In this paper, we propose a framework for use as a starting point for a text message normalization system. Our system uses statistical methods to determine whether or not to remove a character based on contextual information, and creates a list of possible abbreviations for English words. We then reverse the mapping to create a look-up table from text message lingo to proper English. This enables recog- nition of abbreviations that did not appear in training. Our results suggest this is a reasonable step toward a normalization system. 2. RELATED WORK Text normalization is an important first step for any text-to-speech (TTS) system and has been widely studied in many formal domains. [2] provides a good resource for text normalization and its associated problems. Table 1 shows the generally accepted processing methods for unknown words with examples from both formal and informal domains. The constant evolution of informal text presents new chal- lenges for normalization. Spell-checking algorithms are mostly inef- fective on this data, perhaps because they do not account for the phe- nomena in text messages. They instead focus on single typographic errors using edit distance, such as [3], or combine edit distance and pronunciation modeling, such as [4]. Method Formal Example Texting Example as chars RSVP “cu” (see you) as word NATO “l8r” (later) expand Corp. “prof” (professor) combine WinNT “neway” (anyway) Table 1. Methods for processing unseen tokens in normalization. The desire to translate SMS from one language to another led Bangalore et al. [5] to use consensus translations to bootstrap a trans- lation system for instant messages and chat rooms where these ab- breviations are common. Aw et al. [1] view text messaging lingo as if it were another language with its own words and grammar and produced grammatically correct English sentences using a statisti- cal MT system. Kobus et al. [6] incorporate a second phase in the translation model that maps characters in the texting abbreviation to phonemes, which are viewed as the output of an automatic speech recognition (ASR) system. They use a non-deterministic phonemic transducer to decode the phonemes into English words. The work of Choudhury et al. [7] describes a supervised noisy channel model using HMMs for text message normalization. Cook and Stevenson [8] modified this work to create an unsupervised noisy channel approach. They created probabilistic models for common abbreviation types and chose the English word with the highest prob- ability after combining the models as the standard form. Yang et al. [9] work with abbreviation generation for spoken Chinese rather than for English text messages, but their process is quite similar to ours. They use conditional random fields (CRFs) as a binary classi- fier to determine the probability of removing a Chinese character to form an abbreviation. They rerank the resulting abbreviations by us- ing a length prior learned from their training data and co-occurrence of the original word and generated abbreviation using web search. Finally, this work builds on our past work [10]. Previously, we 5364 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011