International Journal of Recent Technology and Engineering (IJRTE)
ISSN: 2277-3878, Volume-8 Issue-4, November 2019
12981
Retrieval Number: D8067118419/2019©BEIESP
DOI:10.35940/ijrte.D8067.118419
Published By:
Blue Eyes Intelligence Engineering
&Sciences Publication
Issues in Urdu-Hindi NER Output of Google and
Bing Translator: An Orthographic Perspective
Md. Tauseef Qamar, Juhi Yasmeen
Abstract: Named Entity Recognition (NER) is a sub-task of
information extraction in which names are extracted both from
the text and linguistic corpora which is still a tough nut to crack
for NLP researchers in existing Machine Translation (MT)
system due to its long tail. Since decades, NER has been an area
of great interest both in MT and computational linguistics, thus,
several tools have been designed for their handling in different
languages. Therefore, this paper aims to compare the end user
output of both Google and Bing translator with special reference
to Urdu-Hindi NER. This will provide more insights in the
development of intelligent language tools. Thus, on the one hand,
the paper deals with orthographic challenges pertaining to Urdu-
Hindi NER in general, while on the other hand, the paper also
sheds light on the transliteration issues in particular. Further, we
have also investigated the personal names, and named entity of
Urdu, especially ezafat constructions. Consequently, the paper
also proposes to handle NER from the language engineering
point of view based on the existing end user output quality.
Furthermore, the MT output of both Google and Bing has been
ranked on the scale of 0 to 1, where 0 assigned to the correct
output while 1 given to the wrong or inaccurate output.
Keywords: Named Entity Recognition, Urdu Orthographic
Challenges, Ezafat, Googl and Bing NER Urdu-Hindi Output.
I. INTRODUCTION
Humans speak a number of different languages around the
globe for the effective communication, and notably they
varies greatly from each other in a number of ways. Apart
from significant differences, there are certain linguistic
components shared by all languages, and one among them is
NER or Naming Entity. Thus, researchers from different
fields like; linguistics, literature, and computer science, etc.
study and research human language to unfold its linguistic
properties and their underlying specialties. Similarly, human
language has also drawn the attention of technology tycoons
like; Google and Microsoft, in addition to computer science
engineers. As a result, a number of intelligent language tools
have been designed to process the human language under
the hood of natural language processing (NLP), some of
them are; Machine Translation (MT), Text to Speech (TTS),
and Voice Assistants, etc. Significantly, in NLP the NER is
one of the vast and active area of research for the last 25
years.
NER is a sub-task of information extraction whereby names
are extracted and classified in a text because it plays an
extremely crucial role in NLP, especially in MT. Similarly,
MT is a sub-field of computational linguistics whereby the
meaning of source language (SL) gets converted into an
equivalent meaning of target language (TL).
Revised Manuscript Received on November 22, 2019
Md. Tauseef Qamar, Ph.D. Scholar, D/O Linguistics, AMU, Aligarh
tauseefqamar007@gmail.com
Dr. Juhi Yasmeen, Ph.D. in Linguistics, AMU, Aligarh
juhi1421@gmail.com
Therefore, it is imperative to deal with the crucial role of
NER while generating the end user output. Evidently,
several tools have designed for the processing of NER in
resource rich language which produces high accuracy, for
example, the English NER extraction tool. This tool
significantly produces high accuracy in terms of end user
output. But, still, there are languages whose corpora is not
sufficient enough to process NER with high accuracy. For
example, Urdu and Hindi especially from the orthographic
perspective (mainly the glottalic/vocalic sound loaned from
Arabic to Urdu) and chiefly those names made of ezafat.
Therefore, the primary objective of this paper is to deal with
the need for diacritics and ezafat adoption in Hindi as per
the systems of Urdu language. As a result, we also propose
careful attention to the transliteration which may pave a path
for better transliteration output in generating the desired end
user output in terms of ezafat into the TL, i.e. Hindi.
Further, transliteration is a process where the phonological
characters of SL gets transferred into the equivalent
phonological character of TL.
Significantly, this paper proposes to identify the need
and consequent challenges of NER transliteration in Urdu-
Hindi scenario. These needs and challenges can be adopted
to improve the inaccuracies in existing Google and Bing
translator’s end user output. Since NER is a vast area, so
existing paper aims to cover only Urdu ezafat names.
Further, in order to test our collected names, we have
translated them on both the translation platforms (Google
and Bing) which shows noticeable inaccuracies in existing
end user output, i.e. Hindi. The existing end user output is
not satisfactory both from orthographic (vocalic/glottalic
sound and ezafat) and translation purposes. Furthermore, the
existing output also presents a clear picture that the
inaccuracies in end user output resulted due to the
homographic (especially from diacritics ‘ehrab’ point of
view) challenges of Urdu in general and ezafat in particular.
This paper is organized in the following ways: section
one (I) deals with the introduction about the NER from
orthographic perspective in general, while section two (II)
sheds light on the related work to NER in Urdu and Hindi in
particular in addition to English. The subsequent section
three (III) and four (IV) unfolds the orthographic challenge
and ezafat (including its types) in the source language
(Urdu), respectively. Further, the section five (V) presents
the overall picture about the existing inaccuracies in end
user output of Google and Bing translator with special
reference to ezafat transliteration, while section six (VI)
focuses on the handling of NER from a linguistics point of
view. Furthermore, sections seven (VII) and eight (VIII)
demonstrates the existing output of Google and Bing
followed by the discussion, sequentially. Finally, section
nine (IX) outlines the concluding remarks about this paper.