International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8 Issue-4, November 2019 12981 Retrieval Number: D8067118419/2019©BEIESP DOI:10.35940/ijrte.D8067.118419 Published By: Blue Eyes Intelligence Engineering &Sciences Publication Issues in Urdu-Hindi NER Output of Google and Bing Translator: An Orthographic Perspective Md. Tauseef Qamar, Juhi Yasmeen Abstract: Named Entity Recognition (NER) is a sub-task of information extraction in which names are extracted both from the text and linguistic corpora which is still a tough nut to crack for NLP researchers in existing Machine Translation (MT) system due to its long tail. Since decades, NER has been an area of great interest both in MT and computational linguistics, thus, several tools have been designed for their handling in different languages. Therefore, this paper aims to compare the end user output of both Google and Bing translator with special reference to Urdu-Hindi NER. This will provide more insights in the development of intelligent language tools. Thus, on the one hand, the paper deals with orthographic challenges pertaining to Urdu- Hindi NER in general, while on the other hand, the paper also sheds light on the transliteration issues in particular. Further, we have also investigated the personal names, and named entity of Urdu, especially ezafat constructions. Consequently, the paper also proposes to handle NER from the language engineering point of view based on the existing end user output quality. Furthermore, the MT output of both Google and Bing has been ranked on the scale of 0 to 1, where 0 assigned to the correct output while 1 given to the wrong or inaccurate output. Keywords: Named Entity Recognition, Urdu Orthographic Challenges, Ezafat, Googl and Bing NER Urdu-Hindi Output. I. INTRODUCTION Humans speak a number of different languages around the globe for the effective communication, and notably they varies greatly from each other in a number of ways. Apart from significant differences, there are certain linguistic components shared by all languages, and one among them is NER or Naming Entity. Thus, researchers from different fields like; linguistics, literature, and computer science, etc. study and research human language to unfold its linguistic properties and their underlying specialties. Similarly, human language has also drawn the attention of technology tycoons like; Google and Microsoft, in addition to computer science engineers. As a result, a number of intelligent language tools have been designed to process the human language under the hood of natural language processing (NLP), some of them are; Machine Translation (MT), Text to Speech (TTS), and Voice Assistants, etc. Significantly, in NLP the NER is one of the vast and active area of research for the last 25 years. NER is a sub-task of information extraction whereby names are extracted and classified in a text because it plays an extremely crucial role in NLP, especially in MT. Similarly, MT is a sub-field of computational linguistics whereby the meaning of source language (SL) gets converted into an equivalent meaning of target language (TL). Revised Manuscript Received on November 22, 2019 Md. Tauseef Qamar, Ph.D. Scholar, D/O Linguistics, AMU, Aligarh tauseefqamar007@gmail.com Dr. Juhi Yasmeen, Ph.D. in Linguistics, AMU, Aligarh juhi1421@gmail.com Therefore, it is imperative to deal with the crucial role of NER while generating the end user output. Evidently, several tools have designed for the processing of NER in resource rich language which produces high accuracy, for example, the English NER extraction tool. This tool significantly produces high accuracy in terms of end user output. But, still, there are languages whose corpora is not sufficient enough to process NER with high accuracy. For example, Urdu and Hindi especially from the orthographic perspective (mainly the glottalic/vocalic sound loaned from Arabic to Urdu) and chiefly those names made of ezafat. Therefore, the primary objective of this paper is to deal with the need for diacritics and ezafat adoption in Hindi as per the systems of Urdu language. As a result, we also propose careful attention to the transliteration which may pave a path for better transliteration output in generating the desired end user output in terms of ezafat into the TL, i.e. Hindi. Further, transliteration is a process where the phonological characters of SL gets transferred into the equivalent phonological character of TL. Significantly, this paper proposes to identify the need and consequent challenges of NER transliteration in Urdu- Hindi scenario. These needs and challenges can be adopted to improve the inaccuracies in existing Google and Bing translator’s end user output. Since NER is a vast area, so existing paper aims to cover only Urdu ezafat names. Further, in order to test our collected names, we have translated them on both the translation platforms (Google and Bing) which shows noticeable inaccuracies in existing end user output, i.e. Hindi. The existing end user output is not satisfactory both from orthographic (vocalic/glottalic sound and ezafat) and translation purposes. Furthermore, the existing output also presents a clear picture that the inaccuracies in end user output resulted due to the homographic (especially from diacritics ‘ehrab’ point of view) challenges of Urdu in general and ezafat in particular. This paper is organized in the following ways: section one (I) deals with the introduction about the NER from orthographic perspective in general, while section two (II) sheds light on the related work to NER in Urdu and Hindi in particular in addition to English. The subsequent section three (III) and four (IV) unfolds the orthographic challenge and ezafat (including its types) in the source language (Urdu), respectively. Further, the section five (V) presents the overall picture about the existing inaccuracies in end user output of Google and Bing translator with special reference to ezafat transliteration, while section six (VI) focuses on the handling of NER from a linguistics point of view. Furthermore, sections seven (VII) and eight (VIII) demonstrates the existing output of Google and Bing followed by the discussion, sequentially. Finally, section nine (IX) outlines the concluding remarks about this paper.