Comparative Analysis of the Performance of CRF, HMM and MaxEnt for Part-of-Speech Tagging, Chunking and Named Entity Recognition for a Morphologically rich language Manish Agarwal * , Rahul Goutam * , Ashish Jain * , Sruthilaya Reddy Kesidi * , Prudhvi Kosaraju * , Shashikant Muktyar * , Bharat Ambati and Rajeev Sangal Language Technologies Research Center International Institute of Information Technology Hyderabad, AP, India - 500032 {manish.agarwal, prudhvi.kosaraju, rahul.goutam, shashikant.muktyar, sruthilaya.kesidi, ambati} @research.iiit.ac.in, ashishjain@students.iiit.ac.in, sangal@iiit.ac.in Abstract In this paper, we present a comparative analysis between three methods for statistical part-of-speech(POS) tagging, chunking and named entity recognition(NER) for a mor- phologically rich language, Hindi, using a large annotated corpus. The methods explored are Conditional Random Fields(CRF), Hidden Markov Models(HMM) and Maxi- mum Entropy Model(MaxEnt). We further propose an it- erative approach as a method to improve the results. To the best of our knowledge, there is no previous work on com- parative analysis of statistical POS tagging, chunking and NER in Hindi using the three methods when a large man- ually annotated corpus is used. The maximum POS tag- ging, chunking and NER accuracies for CRF, HMM and MaxEnt achieved are (94.00%, 91.70%, 56.03%), (92.96%, 89.23%, 48.21%) and (92.88%, 85.48%, 49.09%) respec- tively. Our work shows that CRF performs consistently better than HMM and MaxEnt for all of the three above- mentioned tasks. 1. Key Words POS tagging, Chunking, NER and Iteration 2. Introduction The objective of POS tagging is to assign part of speech tags to natural language text based on both its definition and its context. Chunking is the task of identifying and seg- menting the text into syntactically correlated word groups. Named Entity Recognition (NER) seeks to locate and clas- sify entities in a text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, etc. All 3 tasks are important sub- components of natural language analysis and information extraction. Various approaches to all three tasks have been explored, but they can be divided into two major categories, rule based and statistical. Three of the major statistical techniques applied are Conditional Random Fields, Hidden Markov Models and Maximum Entropy Models. While considerable work has been done involving each technique for the three tasks(see related work), there is no comparative analysis of the three techniques on a large training corpus for a morphologically rich language like Hindi. Our work presents a comparative analysis of sta- tistical POS tagging, chunking and Named Entity Recog- nition(NER) for Hindi using the three techniques - Condi- tional Random Fields (CRF, [4]), Hidden Markov Model (HMM, [3]) and Maximum Entropy Model (MaxEnt). We further propose an iterative method which can be used to mutually improve the performance of two related tasks. In this approach, the features used in task 1 can be implicitly used in the machine learning of task 2 and vice versa. The data sparseness problem is the major motivation be- hind our iterative approach where the features used in task 1can be implicitly used in task 2 and vice versa by perform- ing iteration between the two tasks. The data is sparse in terms of the number of instances of each class of individual tags. We have explained it further in section 3. 3. Related Work Lafferty et al. [4] proposed a conditional random field framework for POS tagging using the PENN treebank cor- pus. They showed that CRF outperforms HMM and HMM outperforms MaxEnt, which was attributed as a conse- 0 The ordering among the star marked authors doesn’t mean anything and they have contributed equally to this work