VOL. 11, NO. 13, JULY 2016 ISSN 1819-6608 ARPN Journal of Engineering and Applied Sciences © 2006-2016 Asian Research Publishing Network (ARPN). All rights reserved. www.arpnjournals.com 8017 EXPERIMENTAL ANALYSIS OF MALAYALAM POS TAGGER USING EPIC FRAMEWORK IN SCALA Sachin Kumar S., M. Anand Kumar and K. P. Soman Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, Coimbatore, India E-Mail: sachinnme@gmail.com ABSTRACT In Natural Language Processing (NLP), one of the well-studiedproblems under constant exploration is part-of- speech tagging or POS tagging or grammatical tagging. The task is to assign labels or syntactic categories such as noun, verb, adjective, adverb, preposition etc. to the words in a sentence or in an un-annotated corpus. This paper presents a simple machine learning based experimental study for POS tagging using a new structured prediction framework known as EPIC, developed in scale programming language. This paper is first of its kind to perform POS tagging in Indian Language using EPIC framework. In this framework, the corpus contains labelled Malayalam sentences in domains like health, tourism and general (news, stories). The EPIC framework uses conditional random field (CRF) for building tagged models. The framework provides several parameters to adjust and arrive at improved accuracy and thereby a better POS tagger model. The overall accuracy were calculated separately for each domains and obtained a maximum accuracy of 85.48%, 85.39%, and 87.35% for small tagged data in health, tourism and general domain. Keywords: parts-of-speech tagging (POS), conditional random field (CRF), AMRITA tag set, EPIC, Malayalam language. 1. INTRODUCTION The part-of-speech (POS) tagging is a well- known problem under constant research in language processing [1]. A POS tagger is an essential tool for parsing, information retrieval, word sense disambiguation, correct lemmatization etc. POS tagging is the process by which the words in the sentence are assigned with tags that shows its syntactic category depending on the context. Or a method by which words in a language are categorized depending on the morphological and syntactic features. The common categories for tag are noun, verb, adverb, adjective, conjunction etc. POS tagging plays an important role in applications like machine translation, language modeling, word sense disambiguation, Question and Answer analysis, dialogue tagging, social media data tagging, information retrieval etc. For example, the following Malayalam word ഇയ denotes a verb and noun as it has two meaning - ഇയඐക, കരക and ഇയഽ, ഒയഴാം. Therefore, the task of the POS is to disambiguate and correctly identify the grammatical category. In the Indian language scenario, POS taggers were developed for Dravidian languages (Kannada, Malayalam, Tamil and Telugu), Hindi, Punjabi, Odia, Marathi and Bengali. Each language have their own tag set prepared by different organization or research groups and it will contain main tags and sub tags which refers its morpho-syntactic features [2]-[15]. The Bureau of Indian Standards (BIS) POS tag set for Indian languages aims to ensure a common language tag set for Indian languages. It was prepared by POS tag standardization Committee, Department of Information Technology, New Delhi. Several methods are applied for POS tagging task. In [16], [17], [18] discusses hidden markov model based POS tagging, memory based learning [19], maximum entropy modeling [20], transformation based learning [21], decision trees [22], [23], support vector machines [24], [13], rule based approach [25], using disambiguation rule [26], [27], hybrid approaches are also been made using stochastic method and rules [28]. Indian languages are morphologically rich and this posses major challenge in disambiguating words thereby the number of tags required will be more to deal with ambiguities. The morphological richness of the language creates difficulty to prepare complex rules for POS tagging. The machine learning approaches uses the linguistically motivated data associated with each language. Due to high inflective nature of the Indian languages, the method/techniques used for one language may not be useful for the other. Several articles for POS tagging the morphologically rich language were proposed in which the stochastic methods and specific hand crafted rules with the help of linguist were developed [29], [30], [31], [32]. This approach raises the requirement of an expert linguist opinion to create accurate rules and large corpus for stochastic methods to be effective. Several approaches related to POS tagging in Malayalam language is also carried out [13], [45]. This paper presents a POS tagger for Malayalam language using EPIC framework in scale language. In this, the POS tagging task is defined as a sequence labeling problem. This is a first attempt to explore the EPIC framework for POS tagging in Indian languages. This paper is organized as follows. Section 'Tagset' gives an overview about AMRITA tag set. Section 'Condition Random Fields' gives a brief introduction about condition random fields. Section ' EPIC framework ' gives an overview about the EPIC framework. In section 'Experimental Result', the experiments and the obtained results are discussed. 1.1 Tag set A tag set represents the tag categories that can be used to tag each word based on the context. Several researchers in Indian language uses different tag set such as AUKBC, Vasuranganathan tag set, CIIL Tag set,