Towards Sindhi Named Entity Recognition: Challenges and opportunities Wazir Ali Department of Computer Science Shah Abdul Latif University Khairpur, Pakistan aliwazirjam@gmail.com Asadullah Kehar Department of Computer Science Shah Abdul Latif University Khairpur, Pakistan asadullah.kehar@salu.edu.pk Hidayatullah Shaikh Department of Computer Science Shah Abdul Latif University Khairpur, Pakistan hidayat.shaikh@salu.edu.pk AbstractIn this paper, we present the challenges and research opportunities in the field of Sindhi Named Entity Recognition (SNER). Sindhi has great influence on the large population specially in the Sindh province of Pakistan, some states of India, and other countries. But unfortunately, the Named Entity Recognition (NER) task has never been investigated due to certain challenges and its complex morphological features. Therefore the focus of this paper is to discuss difficulties and future research opportunities in the field of NER in the Sindhi language. The study reveals the importance of Sindhi, present methods in NER with applications, challenges to the development of SNER system and directions for future research. Index TermsSindhi named entity recognition, low-resources, research challenges I. INTRODUCTION Named Entities are the atomic elements in the text, and NER is said to be the extraction and classification of named entities with the text [1]. It is used to find NEs and classify them into pre-defined categories, such as names of person, organization, location, designation, brand, measurement, abbreviation, date, and time [2] within the text. For example, consider the following example: “Ken Thompson and Denis Ritchie created the UNIX operating system in 1970 at Bell Labs”. An accurate NER system will extract the following named entities. “Ken Thompson” and “Denis Ritchie” as names of persons “UNIX” as an abbreviation “1970” as a date “Bell Labs” as an Organization The NER is a fundamental process [3] for almost all Natural Language Processing (NLP) applications. It is a sub-task of Information extraction [4] widely used iun many commercial applications on internet such as search engine. The research on NER was initially concentrated message understanding conferences [5]. After these conferences, many NER systems were developed for English and other European languages with high accuracy. In these conferences, some conventions were followed for the NEs which include NUMEX (numerical entities), TIMEX (temporal entities) and ENAMEX (names) [3]. The NER task for essential for information extraction [6], question answering [7], text summarization [8], information retrieval, relationship identification [9], machine translation [10], semantic web-search [11], bio-informatics [12], text 1 https://www.mustgo.com/worldlanguages/sindhi/ mining [13] and others. Unfortunately, the NER systems for South Asian languages are still at the developing phase [14]. In this regard International Journal of Computing and Natural Language Processing (IJCNLP), 2008 workshop [15] played a key role in developing the NER system for South Asian Languages specially for Hindi, Oriya, Bangali, Telugu, and Urdu [16]. The proposed tags in IJCNLP-2008 are depicted in I is with an example of Sindhi NEs. Sindhi is one of the most ancient and historical languages spoken in Pakistan and India. In the best of knowledge the work for the development of SNER has never been initiated due to the unavailability of resources and its complex morphological structure. Literature suggests that most of the South and South East Asian languages (SSEAL) [17] relatively have the same challenges concerned with Sindhi Language. Therefore, this paper aims to explore the the challenges and research opportunities in the field of SNER. The rest of the paper is organized as follows: Brief importance of Sindhi language is highlighted in Section 2, Section 3 discusses the NER approaches. The applications of NER are presented in Section 4. The challenges related to the development of SNER system are given in the Section 6 along with future research opportunities. The paper is concluded in Section 7 respectively. II. BRIEF OVERVIEW OF SINDHI LANGUAGE Sindhi is the official language of Sindh province. It is the native language of 40 million Sindhi people living in Pakistan. After Urdu, it is the second most spoken language in Pakistan. In India, Ulhasnagar near Mumbai is the largest Sindhi speaking region, and it is also spoken in other states of India, especially in Rajasthan, Gujarat, and Maharashtra. It is the fourth-most-spoken language in other countries such as the United-States, Australia, the United-Kingdom, and Canada, where a large number of Sindhi people have emigrated. The total Sindhi speakers in the world is over 42 million 1 . The current script of Sindhi is derived from the Arabic language. It lies in the category of right to left writing languages like Persian, Siraiki, Panjabi, Hindi, and Urdu, etc. Sindhi is spoken in multiple dialects including Kachchi, Lari, Lasi, Thareli,