VOL. 11, NO. 13, JULY 2016 ISSN 1819-6608
ARPN Journal of Engineering and Applied Sciences
© 2006-2016 Asian Research Publishing Network (ARPN). All rights reserved.
www.arpnjournals.com
8017
EXPERIMENTAL ANALYSIS OF MALAYALAM POS TAGGER USING
EPIC FRAMEWORK IN SCALA
Sachin Kumar S., M. Anand Kumar and K. P. Soman
Centre for Excellence in Computational Engineering and Networking, Amrita Vishwa Vidyapeetham, Coimbatore, India
E-Mail: sachinnme@gmail.com
ABSTRACT
In Natural Language Processing (NLP), one of the well-studiedproblems under constant exploration is part-of-
speech tagging or POS tagging or grammatical tagging. The task is to assign labels or syntactic categories such as noun,
verb, adjective, adverb, preposition etc. to the words in a sentence or in an un-annotated corpus. This paper presents a
simple machine learning based experimental study for POS tagging using a new structured prediction framework known as
EPIC, developed in scale programming language. This paper is first of its kind to perform POS tagging in Indian Language
using EPIC framework. In this framework, the corpus contains labelled Malayalam sentences in domains like health,
tourism and general (news, stories). The EPIC framework uses conditional random field (CRF) for building tagged models.
The framework provides several parameters to adjust and arrive at improved accuracy and thereby a better POS tagger
model. The overall accuracy were calculated separately for each domains and obtained a maximum accuracy of 85.48%,
85.39%, and 87.35% for small tagged data in health, tourism and general domain.
Keywords: parts-of-speech tagging (POS), conditional random field (CRF), AMRITA tag set, EPIC, Malayalam language.
1. INTRODUCTION
The part-of-speech (POS) tagging is a well-
known problem under constant research in language
processing [1]. A POS tagger is an essential tool for
parsing, information retrieval, word sense disambiguation,
correct lemmatization etc. POS tagging is the process by
which the words in the sentence are assigned with tags that
shows its syntactic category depending on the context. Or
a method by which words in a language are categorized
depending on the morphological and syntactic features.
The common categories for tag are noun, verb, adverb,
adjective, conjunction etc. POS tagging plays an important
role in applications like machine translation, language
modeling, word sense disambiguation, Question and
Answer analysis, dialogue tagging, social media data
tagging, information retrieval etc. For example, the
following Malayalam word ഇയ denotes a verb and
noun as it has two meaning - ഇയඐക, കരക
and ഇയഽ, ഒയഴാം. Therefore, the task of
the POS is to disambiguate and correctly identify the
grammatical category.
In the Indian language scenario, POS taggers
were developed for Dravidian languages (Kannada,
Malayalam, Tamil and Telugu), Hindi, Punjabi, Odia,
Marathi and Bengali. Each language have their own tag set
prepared by different organization or research groups and
it will contain main tags and sub tags which refers its
morpho-syntactic features [2]-[15]. The Bureau of Indian
Standards (BIS) POS tag set for Indian languages aims to
ensure a common language tag set for Indian languages. It
was prepared by POS tag standardization Committee,
Department of Information Technology, New Delhi.
Several methods are applied for POS tagging
task. In [16], [17], [18] discusses hidden markov model
based POS tagging, memory based learning [19],
maximum entropy modeling [20], transformation based
learning [21], decision trees [22], [23], support vector
machines [24], [13], rule based approach [25], using
disambiguation rule [26], [27], hybrid approaches are also
been made using stochastic method and rules [28]. Indian
languages are morphologically rich and this posses major
challenge in disambiguating words thereby the number of
tags required will be more to deal with ambiguities. The
morphological richness of the language creates difficulty
to prepare complex rules for POS tagging. The machine
learning approaches uses the linguistically motivated data
associated with each language. Due to high inflective
nature of the Indian languages, the method/techniques
used for one language may not be useful for the other.
Several articles for POS tagging the morphologically rich
language were proposed in which the stochastic methods
and specific hand crafted rules with the help of linguist
were developed [29], [30], [31], [32]. This approach raises
the requirement of an expert linguist opinion to create
accurate rules and large corpus for stochastic methods to
be effective. Several approaches related to POS tagging in
Malayalam language is also carried out [13], [45]. This
paper presents a POS tagger for Malayalam language
using EPIC framework in scale language. In this, the POS
tagging task is defined as a sequence labeling problem.
This is a first attempt to explore the EPIC framework for
POS tagging in Indian languages.
This paper is organized as follows. Section
'Tagset' gives an overview about AMRITA tag set. Section
'Condition Random Fields' gives a brief introduction about
condition random fields. Section ' EPIC framework ' gives
an overview about the EPIC framework. In section
'Experimental Result', the experiments and the obtained
results are discussed.
1.1 Tag set
A tag set represents the tag categories that can be
used to tag each word based on the context. Several
researchers in Indian language uses different tag set such
as AUKBC, Vasuranganathan tag set, CIIL Tag set,