Research Article
Mixed Script Identification Using Automated DNN
Hyperparameter Optimization
Muhammad Yasir ,
1
Li Chen ,
1
Amna Khatoon ,
2
Muhammad Amir Malik ,
3
and Fazeel Abid
4
1
School of Information Science and Technology, Northwest University, Xi’an, Shaanxi, China
2
Department of Information Engineering, Chang’an University, Xi’an, Shaanxi, China
3
Department of Computer Science, Islamic International University, Islamabad, Pakistan
4
Department of Information System, University of Management and Technology, Lahore, Pakistan
CorrespondenceshouldbeaddressedtoLiChen;chenli@nwu.edu.cn
Received 3 October 2021; Revised 30 October 2021; Accepted 5 November 2021; Published 10 December 2021
AcademicEditor:AhmedMostafaKhalil
Copyright©2021MuhammadYasiretal.isisanopenaccessarticledistributedundertheCreativeCommonsAttribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Mixedscriptidentificationisahindranceforautomatednaturallanguageprocessingsystems.Mixingcursivescriptsofdifferent
languagesisachallengebecauseNLPmethodslikePOStaggingandwordsensedisambiguationsufferfromnoisytext.isstudy
tacklesthechallengeofmixedscriptidentificationformixed-codedatasetconsistingofRomanUrdu,Hindi,Saraiki,Bengali,and
English. e language identification model is trained using word vectorization and RNN variants. Moreover, through exper-
imental investigation, different architectures are optimized for the task associated with Long Short-Term Memory (LSTM),
BidirectionalLSTM,GatedRecurrentUnit(GRU),andBidirectionalGatedRecurrentUnit(Bi-GRU).Experimentationachieved
thehighestaccuracyof90.17forBi-GRU,applyinglearnedwordclassfeaturesalongwithembeddingwithGloVe.Moreover,this
studyaddressestheissuesrelatedtomultilingualenvironments,suchasRomanwordsmergedwithEnglishcharacters,generative
spellings, and phonetic typing.
1. Introduction
Code-mixing is defined as “the embedding of linguistic
componentssuchasphrases,words,andlexemesfromone
languageintoanexpressionfromanotherlanguage.”Code-
mixing refers to the use of linguistic units’ words, phrases,
clausesfromdifferentlanguagesatasentencelevel.Oneor
morelanguageshavebeencombinedtoformanintelligible
newlanguage.ishybridlanguageisknownasafusedlect.
“Code-switching” is considered as unregulated choice by
linguists, and is also known as “language mixing,” or as
“fused lects” in cases where grammar is rigid.
Wherecode-switchingbetweentwoormorelanguagesis
prevalent,termsfrombothlanguagesmaybecomecommon
in sentences. Instead of switching codes at semantically or
sociolinguistically significant points, this code-mixing has
no particular value in the immediate context. Because they
are completely grammaticalized, fused lects allow for less
varietythanamixedlanguagebecauseoftheirsemanticsand
pragmatics.egrammarofthefusedlectdetermineswhich
source-language parts may be included in the fusion. It is
observedinaninformalsetting,likesocialmediacommonly.
Withtheabundanceofsocialmediaplatformsavailablefor
people to communicate, the quota of code-mixed data
available to us is tremendous. e content shared in social
media discussions is frequently mixed with stylistic and
misspelled versions of original words. POS tagging and
namedentityidentificationsufferduetothenoisyinput.In
addition, social media users often utilize mixed scripts of
Roman text.
e use of Roman script leads to the generation of in-
formal mixed language amalgamation of two or more lan-
guages. is phenomenon is observed on social media
specifically. e multilingual users are using the roman
Hindawi
Computational Intelligence and Neuroscience
Volume 2021, Article ID 8415333, 13 pages
https://doi.org/10.1155/2021/8415333