Named Entity Recognition in Punjabi Using Hidden Markov Model Deepti Chopra 1 , Sudha Morwal 2 Department of Computer Science Banasthali Vidyapith Jaipur, INDIA deeptichopra11@yahoo.co.in, sudha_morwal@yahoo.co.in Abstract— Named Entity Recognition (NER) is a task to discover the Named Entities (NEs) in a document and then categorize these NEs into diverse Named Entity classes such as Name of Person, Location, River, Organization etc. Since, huge amount of work in NER has been done in English; so, we now need to concentrate ourselves in performing NER in the Indian languages (IL). As, Punjabi is not only the Indian language but also it is the official language of Punjab, So we have developed NER based system for Punjabi. This paper discusses about NER, approaches of NER and the results achieved by us by performing NER in Punjabi using Hidden Markov Model (HMM). Keywords- Accuracy; HMM; Named Entities; NER; Performance Metrics I. INTRODUCTION Named Entity Recognition (NER) is considered as one of the key task in Natural language Processing and it forms the base for numerous applications such as Information Retrieval, Information Extraction, Question Answering, Text Summarization, Machine Translation etc.[1][2] NER involves identification as well as the task of classification of Named Entities (NEs) in a given document. It may be defined as the procedure to search for the Named Entities (NEs) or proper nouns in a corpus and then classify them into different classes of NEs such as Name of Person, Organization, Location, City, River, Quantity, Percentage, Time etc. Consider a sentence in Punjabi: ਕਿਹਣ-‘ਛੱ ਡ/OTHER ਪਰਹ/OTHER ਚਰਨਕੋਰੇ /PER ,/OTHER ਿਕ/OTHER ਕਲਪੀ/OTHER ਜਾਨੀ/OTHER ਆਂ /OTHER ।/OTHER ਟੈਲ/OTHER ਲੁਆ/OTHER ਗੁਰਦੁਆਰੇ/LOC ।/OTHER ਿਮਸਰੀ/DRYFRUIT ,/OTHER ਕਾਜੂ/DRYFRUIT ,ਬਦਾਮ/DRYFRUIT ,/OTHER ਖਰੋਟ/DRYFRUIT ਚੱ ਬਣ/OTHER ਨੂ ੰ /OTHER ।/OTHER In the above tagged Punjabi text, NER based system identifies the NEs and its class and then allot an appropriate tag to it. In this sentence, various classes of NEs are: {‘PER’, ‘LOC’, DRYFRUIT’}. Here PER signifies Name of Person and LOC signifies Name of Location. In the following paper, we have discussed about NER based system particularly for Punjabi language using Hidden Markov Model (HMM) II. APPROACHES FOR NER There are two methods used for performing NER i.e. Rule Based Approach and Machine Learning Based Approach.. [3][4][5] The Rule Based Approach can either be List lookup Approach or a Linguistic Approach. To perform NER using List lookup Approach or a Linguistic approach, a lot of human effort is required. In the List lookup Approach, firstly the Gazetteers are constructed, that contain collection of Named Entity classes. Then, we can perform a search operation to find out that a given word in a corpus is found under which category of a Named Entity class. In a Linguistic Approach, a Linguist frames certain set of rules to identify the NEs in a corpus and also to classify these NEs into different Named Entity classes. [1][6][7][8] In a Machine learning based approach, less human effort is required. So, it is also known as automated approach. It is of the following types: Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM), Conditional Random Field (CRF), Support Vector Machine (SVM) and Decision Tree. [1][9][10] Among all these approaches, HMM is one of the easiest approach to implement but it requires large amount of training. There are basically 3 parameters that are employed in HMM, these include: Start Probability, Deepti Chopra et al./ International Journal of Computer Science & Engineering Technology (IJCSET) ISSN : 2229-3345 Vol. 3 No. 12 Dec 2012 616