Initial Normalization of User Generated Content: Case Study in a Multilingual Setting Bagdat Myrzakhmetov, Zhandos Yessenbayev and Aibek Makazhanov National Laboratory Astana 53 Kabanbay batyr ave., Astana, Kazakhstan E-mail: {bagdat.myrzakhmetov, zhyessenbayev, aibek.makazhanov}@nu.edu.kz Abstract—We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy. Index Terms—user generated content, normalization, code switching, transliteration I. INTRODUCTION User generated content (UGC) generally refers to any type of content, i.e. photo, video, audio, text, created by Internet users. In computational linguistics (CL) and natural language processing (NLP) communities UGC is often associated with user generated text, and particularly, noisy text, such as tweets and user comments. UGC is notoriously difficult to process due to prompt introduction of neologisms, e.g. esketit (stands for let’s get it, pronounced [ɛɕˈkerɛ]), and peculiar spelling, e.g. b4 (stands for before). Moreover speakers of more than one language tend to mix them in UGC (a phenomenon commonly referred to as code-switching) and/or use transliteration (spelling in non-national alphabets). All of this increases lexical variety, thereby aggravating the most prominent problems of CL/NLP, such as out-of-vocabulary lexica and data sparseness. It has been repeatedly shown that NLP methods struggle when applied to UGC directly [1]-[4] and that certain preprocessing is required for them to work properly. Such preprocessing is commonly referred to as lexical normalization or simply normalization. To this end, research on UGC normalization is of utmost interest to NLP community and for the past three years there have been held three shared task competitions in three consecutive WNUT workshops [5]-[7]. Kazakhstani segment of Internet is not except from noisy UGC and the following cases are the usual suspects in wreaking the “spelling mayhem”: spontaneous transliteration – switching alphabets, respecting no particular rules or standards, e.g. Kazakh word “біз” (we as pronoun; awl as noun) can be spelled in three additional ways: “биз”, “быз”, and “biz”; use of homoglyphs – interchangeable use of identical or similar looking Latin and Cyrillic alphabets, e.g. Cyrillic letters “е” (U+0435), “с” (U+0441), “і” (U+0456), and “р” (U+0440) in the Kazakh word «есірткі» (drugs) can be replaced with Latin homoglyphs “e” (U+0065), “c” (U+0063), “i” (U+0069), and “p” (U+0070), which, although appear identical, have different Unicode values; code switching – use of Russian words and expressions in Kazakh text and vice versa; word transformations – excessive duplication of letters, e.g. “керемееет” instead of “керемет” (great), or segmentation of words, e.g. “к е р е м е т” or “к-е-р-е- м-е-т”. In this work we propose an approach for initial normalization of UGC. Here an important distinction must be drawn. Unlike with lexical normalization [1], for initial normalization we do not attempt to recover standard spelling of ill-formed words, in fact, we do not even bother detecting those. All that we really care about at this point is to provide an intermediate representation of the input UGC that will not necessarily match its lexically normalized version, but will be less sparse. Thus, we aim at improving performance of downstream applications by reducing vocabulary size (effectively, parameter space) and OOV rate. To this end, initial normalization does two things: (i) converts the input into a common script (Russian Cyrillic based alphabet with some omissions); (ii) recovers word transformations and does various minor replacements. Difference between lexical and initial normalization is depicted by the example in Table I. Notice how for a given Kazakh text lexical normalization increases and initial normalization decreases the number of unique characters. Our approach amounts to successive application of three straightforward procedures: (i) homoglyph resolution, (ii) common script transliteration, (iii) replacement and transformation. To assess the extent of data sparseness reduction we calculate the basic statistics, such as vocabulary size, token-type ration, and OOV rate, for raw and normalized data and show that our approach substantially reduces lexical variety. In addition to that we perform extrinsic evaluation of our approach testing it in the framework of language identification and sentiment analysis tasks. In both cases we report improvement in terms of per-language and overall accuracy.