Normalization of Chinese chat language
Kam-Fai Wong · Yunqing Xia
Published online: 29 April 2008
© Springer Science+Business Media B.V. 2008
Abstract Real-time communication platforms such as ICQ, MSN and online chat
rooms are getting more popular than ever on the Internet. There are, however, real
risks where criminals and terrorists can perpetrate illegal and criminal abuses. This
highlights the security significance of accurate detection and translation of the chat
language to its stand language counterpart. The language used on these platforms
differs significantly from the standard language. This language, referred to as chat
language, is comparatively informal, anomalous and dynamic. Such features render
conventional language resources such as dictionaries, and processing tools such as
parsers ineffective. In this paper, we present the NIL corpus, a chat language text
collection annotated to facilitate training and testing of chat language processing
algorithms. We analyse the NIL corpus to study the linguistic characteristics and
contextual behaviour of a chat language. First we observe that majority of the chat
terms, i.e. informal words in a chat text, is formed by phonetic mapping. We then
propose the eXtended Source Channel Model (XSCM) for the normalization of the
chat language, which is a process to convert messages expressed in a chat language
to its standard language counterpart. Experimental results indicate that the perfor-
mance of XSCM in terms of chat term recognition and normalization accuracy is
superior to its Source Channel Model (SCM) counterparts, and is also more con-
sistent over time.
This is an extension of the paper presented at COLING/ACL 2006 (Xia et al. 2006b).
K.-F. Wong
Department of Systems Engineering & Engineering Management, The Chinese University
of Hong Kong, Shatin, NT, Hong Kong
e-mail: kfwong@se.cuhk.edu.hk
Y. Xia (&)
Centre for Speech and Language Technologies, RIIT, Tsinghua University, Beijing 100084, China
e-mail: yqxia@tsinghua.edu.cn
123
Lang Resources & Evaluation (2008) 42:219–242
DOI 10.1007/s10579-008-9067-7