Survey of various POS tagging techniques for Indian regional languages Shubhangi Rathod #1 , Sharvari Govilkar *2 #1,2 Department of Computer Engineering, University of Mumbai, PIIT, New Panvel, India Abstract—Part of Speech tagging (POS) is an important tool for processing natural languages. It is one of the simplest as well as most stable and statistical model for many Natural language processing (NLP) applications. It is the process of marking up a word in a corpus as corresponding to a particular part of speech like noun, verb, adjective and adverb. There are many challenges in POS tagging like Foreign words, Ambiguities, ungrammatical input etc. In this paper, comparison of various POS tagging techniques for Indian regional languages has been discussed elaborately. Keywords— POS-Part of Speech, Rule Based Approach Statistical Approach, Hybrid Approach. I. INTRODUCTION Natural language processing is the skill of a computer program to understand human language as it is spoken. It is a component of computer science, linguistics and artificial intelligence. To build NLP application is a difficult because human speech is not always specific. NLP is a process of developing a system that can read text and translate between one human language and another. The work on Part-of-Speech (POS) tagging has begun in the early 1960s [2]. Part of Speech tagging is an important tool for processing natural languages. It is one of the simplest as well as most stable and statistical model for many NLP applications POS Tagging is an initial stage of information extraction, summarization, retrieval, machine translation, speech conversion [2]. The part of speech tagging is a process of assign appropriate parts of speech tags like noun, verb, adjective and adverb to each word in a input sentence. It is the process of marking up a word in a corpus as corresponding to a particular part of speech use its definition, as well as its relation. POS tags are also known as word classes, morphological classes, or lexical tags to choose correct grammatical tag for word on the basis of linguistic feature [3]. The main challenge in POS tagging is to resolving the ambiguity in possible POS tags for a word. The paper presents a detail survey of various part of speech tagging techniques. Related work and past literature is discussed in section 2. Basic working of POS tagger is discussed in section 3. Type of POS tagging techniques and comparison based on different criteria is discussed in section 4. Finally, section 5 concludes the paper. II. LITERATURE SURVEY The process of POS tagging consists of these stages Tokenization, Assign a tag to tokenized word and search for Ambiguous word. Disambiguation is done by analyzing the linguistic feature of the word, its preceding word, its following word, etc. Considerable work is already done for foreign languages if we look at the same scenario for South-Asian languages such as Marathi and Hindi, it find out that not much work has been done. As Marathi is a morphological rich language and unavailability of annotated corpora. In 2014, Pallavi Bagul, Archana Mishra, Prachi Mahajan, Medinee Kulkarni, Gauri Dhopavkar [1] proposed a rule based pos tagger for Marathi language. The input sentence sent to tokenized function, the one which tokenizes the string into tokens and then comparing tokens with the Word Net. Tagging module assigns a tag to tokenized word and search for ambiguous word and pronoun. The ambiguous words are those words which can act as a noun and adjective in certain context, or act as an adjective and adverb in certain context. Then their ambiguity is resolved using Marathi grammar rules. Author used a corpus which is based on tourism domain called annotated corpus and 3 grammar rules are used for the experiment to resolve ambiguous word which acts a noun and adjective in certain context, or act as an adjective and adverb in certain context. H.B. Patil, A.S. Patil, B.V. Pawar [2] proposed a Part- of-Speech Tagger for Marathi Language using Limited Training Corpora. It is also a rule based technique. Here sentence taken as an input generated tokens. Once token generated apply the stemming process to remove all possible affix and reduce the word to stem. SRR used to convert stem word to root word. They developed 25 SRR rule. The root-words that are identified are then given to morphological analyzer. The morphological analysis is carried out by dictionary lookup and morpheme analysis rules. Disambiguation is removed by the use of rule-base model or Hidden Markov Model. Based on the corpus they have identified 11 disambiguation rules that are used to remove the ambiguity. Stemming process removes all possible affixes, it change the meaning of stem word like (Anischit-Nischit).The size of the corpus is increased then more Rules can be discovered which will help to reduce the error rate. Shubhangi Rathod et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (3) , 2015, 2525-2529 www.ijcsit.com 2525