IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 02 Special Issue: 02 | Dec-2013, Available @ http://www.ijret.org 58 A STUDY ON THE APPROACHES OF DEVELOPING A NAMED ENTITY RECOGNITION TOOL Hridoy Jyoti Mahanta 1 1 Department of Information Technology, Assam University Abstract Named entity recognition (NER) is of vital importance in information extraction in natural language processing. Identifying the named entities in a piece of text and classifying them with proper tagging can help in getting a lot of information engraved in the particular text. The following paper presents brief details about the various approaches in developing a NER. Also an overview of the various models and learning methodologies used for the statistical approach is also provided. The various factors that need to be considered in developing this tool are also stated. Keywords: Named Entity Recognition, Information Extraction, Corpus, Enamex. ----------------------------------------------------------------------***-------------------------------------------------------------------- 1. INTRODUCTION Natural Language Processing (NLP) is theoretically motivated range of computational technique for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human like language processing for range of tasks or application. [see. Encyclopaedia of Library and Information Science, Liddy, E. D] Natural language processing has been exploring various fields in language processing through information extraction, information retrieval, machine translation, speech recognition, question answering etc. With growing interests of the scholars, linguist, scientists many new methodologies for language processing have come up. Named Entity Recognition is a subtask of information extraction that seeks to locate and classify the proper names in a text. NER is an indispensible part in any Natural Language Processing applications such as: question answering, machine translation, information extraction and information retrieval 2. ABOUT NER With a large focus on information extraction it was found necessary to identify information units such as names of person, place, organization, temporal expressions etc. It became evident that, in order to achieve good performance in information extraction, a system needs to be able to recognize names. So identifying these entities became a prime task of IE and thus “named entity recognition” was acknowledged. Much research has since been carried out on NER, using both knowledge engineering and machine learning approaches. In its canonical form, the input of an NER system is a text and the output is information on boundaries and types of NEs found in the text. For instance consider the following text “Richard Marx and Julia Smith work for Microsoft Corp. and are currently settled in Boston.” The above text fed to an NER will produce an output of the form as shown below “Richard Marx [PERSON] and Julia Smith [PERSON] work for Microsoft Corp.[ORGANIZATION] and are currently settled in Boston [LOCATION].” 3. RELATED FACTORS As large number of papers related to NER was presented in the message understanding conference (MUC), some basic factors came up, which needed prime emphasis while working in this field. 3.1 Language Factor A large amount of research work has been done in English. Most of the domains in English have been explored. But this has confined the work to the particular language only. Language independence and multilingual are the prime problems in this field. Languages like German, Spanish and Dutch have been studied in their own prospect. Chinese, Japanese, French, Italian, Greek, have been studied in abundant of literature. Survey on Bulgarian, Hindi, Danish, Korean, and Turkish are in progress. Even Arabic has started receiving attention. But this works continues to be in it own boundary. Developing a multilingual NER is of prime attention in the current scenario. [1] 3.2 Domain Factor Initial work for NER ignored the two main factors of a language, firstly the type of text or textual genre like informal,