Contents lists available at ScienceDirect Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman An unsupervised lexical normalization for Roman Hindi and Urdu sentiment analysis Khawar Mehmood ,a , Daryl Essam a , Kamran Shaf a , Muhammad Kamran Malik b a School of Engineering and Information Technology(SEIT), University of New South Wales Canberra, Australia b Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan ARTICLE INFO Keywords: Machine learning Natural language processing Pattern recognition Sentiment analysis ABSTRACT Text normalization is the task of transforming lexically variant words to their canonical forms. The importance of text normalization becomes apparent while developing natural language processing applications. This paper proposes a novel technique called Transliteration based Encoding for Roman Hindi/Urdu text Normalization (TERUN). TERUN utilizes the linguistic aspects of Roman Hindi/Urdu to transform lexically variant words to their canonical forms. It consists of three interlinked modules: transliteration based encoder, flter module and hash code ranker. The encoder generates all possible hash-codes for a single Roman Hindi/Urdu word. The next component flters the irrelevant codes, while the third module ranks the fltered hash-codes based on their relevance. The aim of this study is not only to normalize the text but to also examine its impact on text classifcation. Hence, baseline classifcation accuracies were computed on a dataset of 11,000 non-standardized Roman Hindi/Urdu sentiment analysis reviews using diferent machine learning algorithms. The dataset was then standardized using TERUN and other established phonetic algorithms, and the classifcation accuracies were recomputed. The cross-scheme comparison showed that TERUN outperformed all the phonetic algorithms and signifcantly reduced the error rate from the baseline. TERUN was then enhanced from a corpus specifc to a corpus independent text normalization technique. To this end, a parallel corpus of 50,000 Urdu and Roman Hindi/Urdu words was manually tagged using a set of comprehensive annotation guidelines. Also, diferent phonetic algorithms and TERUN were intrinsically eval- uated using a dataset of 20,000 lexically variant words. The results clearly showed the superiority of TERUN over well-known phonetic algorithms. 1. Introduction The invention of low-cost, high-speed internet and handheld devices has encouraged users to publish a variety of content on social networks and weblogs. This content includes feedback on products and instant messages, and constitute an invaluable source of information. To get a glimpse of the immensity of this data, Whatsapp alone generates more than 65 billion messages a day, 1 whereas twitter generates approximately 9,000 tweets a second. 2 While using these media, people deviate from standard dialect and arguably follow idiolectic grammar rules (Blevins et al., 2016). These idiolects result in non-standard words (Baldwin et al., 2015), phono- logical variations (Eisenstein, 2013) and non-lexical sentiment intensifers (Eryi ̇ gi ̇ t & Torunoglu-Selamet, 2017). These challenges https://doi.org/10.1016/j.ipm.2020.102368 Received 22 January 2020; Received in revised form 7 August 2020; Accepted 7 August 2020 Corresponding author. E-mail address: k.mehmood@unsw.edu.au (K. Mehmood). 1 https://www.connectivasystems.com/whatsapp-facts-stats-2020/#WhatsApp_Facts_and_Stats_about_Usage_in_2020 - Last visited on 24-4-2020. 2 https://www.internetlivestats.com/one-second/#tweets-band - Last visited on 24-4-2020. Information Processing and Management xxx (xxxx) xxxx 0306-4573/ © 2020 Elsevier Ltd. All rights reserved. Please cite this article as: Khawar Mehmood, et al., Information Processing and Management, https://doi.org/10.1016/j.ipm.2020.102368