Normalization of Chinese chat language Kam-Fai Wong · Yunqing Xia Published online: 29 April 2008 © Springer Science+Business Media B.V. 2008 Abstract Real-time communication platforms such as ICQ, MSN and online chat rooms are getting more popular than ever on the Internet. There are, however, real risks where criminals and terrorists can perpetrate illegal and criminal abuses. This highlights the security significance of accurate detection and translation of the chat language to its stand language counterpart. The language used on these platforms differs significantly from the standard language. This language, referred to as chat language, is comparatively informal, anomalous and dynamic. Such features render conventional language resources such as dictionaries, and processing tools such as parsers ineffective. In this paper, we present the NIL corpus, a chat language text collection annotated to facilitate training and testing of chat language processing algorithms. We analyse the NIL corpus to study the linguistic characteristics and contextual behaviour of a chat language. First we observe that majority of the chat terms, i.e. informal words in a chat text, is formed by phonetic mapping. We then propose the eXtended Source Channel Model (XSCM) for the normalization of the chat language, which is a process to convert messages expressed in a chat language to its standard language counterpart. Experimental results indicate that the perfor- mance of XSCM in terms of chat term recognition and normalization accuracy is superior to its Source Channel Model (SCM) counterparts, and is also more con- sistent over time. This is an extension of the paper presented at COLING/ACL 2006 (Xia et al. 2006b). K.-F. Wong Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Shatin, NT, Hong Kong e-mail: kfwong@se.cuhk.edu.hk Y. Xia (&) Centre for Speech and Language Technologies, RIIT, Tsinghua University, Beijing 100084, China e-mail: yqxia@tsinghua.edu.cn 123 Lang Resources & Evaluation (2008) 42:219–242 DOI 10.1007/s10579-008-9067-7