Contents lists available at ScienceDirect
Information Processing and Management
journal homepage: www.elsevier.com/locate/infoproman
An unsupervised lexical normalization for Roman Hindi and Urdu
sentiment analysis
Khawar Mehmood
⁎
,a
, Daryl Essam
a
, Kamran Shaf
a
, Muhammad Kamran Malik
b
a
School of Engineering and Information Technology(SEIT), University of New South Wales Canberra, Australia
b
Punjab University College of Information Technology (PUCIT), University of the Punjab, Lahore, Pakistan
ARTICLE INFO
Keywords:
Machine learning
Natural language processing
Pattern recognition
Sentiment analysis
ABSTRACT
Text normalization is the task of transforming lexically variant words to their canonical forms.
The importance of text normalization becomes apparent while developing natural language
processing applications. This paper proposes a novel technique called Transliteration based
Encoding for Roman Hindi/Urdu text Normalization (TERUN). TERUN utilizes the linguistic
aspects of Roman Hindi/Urdu to transform lexically variant words to their canonical forms. It
consists of three interlinked modules: transliteration based encoder, flter module and hash code
ranker. The encoder generates all possible hash-codes for a single Roman Hindi/Urdu word. The
next component flters the irrelevant codes, while the third module ranks the fltered hash-codes
based on their relevance. The aim of this study is not only to normalize the text but to also
examine its impact on text classifcation. Hence, baseline classifcation accuracies were computed
on a dataset of 11,000 non-standardized Roman Hindi/Urdu sentiment analysis reviews using
diferent machine learning algorithms. The dataset was then standardized using TERUN and
other established phonetic algorithms, and the classifcation accuracies were recomputed. The
cross-scheme comparison showed that TERUN outperformed all the phonetic algorithms and
signifcantly reduced the error rate from the baseline. TERUN was then enhanced from a corpus
specifc to a corpus independent text normalization technique. To this end, a parallel corpus of
50,000 Urdu and Roman Hindi/Urdu words was manually tagged using a set of comprehensive
annotation guidelines. Also, diferent phonetic algorithms and TERUN were intrinsically eval-
uated using a dataset of 20,000 lexically variant words. The results clearly showed the superiority
of TERUN over well-known phonetic algorithms.
1. Introduction
The invention of low-cost, high-speed internet and handheld devices has encouraged users to publish a variety of content on social
networks and weblogs. This content includes feedback on products and instant messages, and constitute an invaluable source of
information. To get a glimpse of the immensity of this data, Whatsapp alone generates more than 65 billion messages a day,
1
whereas
twitter generates approximately 9,000 tweets a second.
2
While using these media, people deviate from standard dialect and arguably
follow idiolectic grammar rules (Blevins et al., 2016). These idiolects result in non-standard words (Baldwin et al., 2015), phono-
logical variations (Eisenstein, 2013) and non-lexical sentiment intensifers (Eryi
̇
gi
̇
t & Torunoglu-Selamet, 2017). These challenges
https://doi.org/10.1016/j.ipm.2020.102368
Received 22 January 2020; Received in revised form 7 August 2020; Accepted 7 August 2020
⁎
Corresponding author.
E-mail address: k.mehmood@unsw.edu.au (K. Mehmood).
1
https://www.connectivasystems.com/whatsapp-facts-stats-2020/#WhatsApp_Facts_and_Stats_about_Usage_in_2020 - Last visited on 24-4-2020.
2
https://www.internetlivestats.com/one-second/#tweets-band - Last visited on 24-4-2020.
Information Processing and Management xxx (xxxx) xxxx
0306-4573/ © 2020 Elsevier Ltd. All rights reserved.
Please cite this article as: Khawar Mehmood, et al., Information Processing and Management,
https://doi.org/10.1016/j.ipm.2020.102368