International Journal of Computer Applications (0975 – 8887) Volume 99– No.19, August 2014 33 Parts of Speech Tagging in Bengali for MWEs Detection Md Jaynal Abedin Department of Computer Science Assam University, Silchar, Assam, India Bipul Syam Purkayastha Department of Computer Science Assam University, Silchar, Assam, India ABSTRACT Part of speech (POS) tagging is the process of assigning the part of speech tag to each and every word in a sentence. In many Natural Language Processing applications such as word sense disambiguation, information retrieval, information processing, parsing, question answering, MWEs detection and machine translation, POS tagging is considered as the one of the basic important tools. Identifying the ambiguities in language lexical items is based on the proper identification of Part of Sspeech (POS) tagging of that language which can enhance the language processing applications in different ways. This paper describes the POS tagset for Multiword Expressions Detection in Bengali (Bangla) which is also very important for many natural language processing (NLP) applications. Keywords MWEs, annotation, tagging, noun, verb, adjective, adverb, postposition, part of-speech 1. INTRODUCTION Due to Bengali (Bangla) language has rich morphological nature, Bangla is a language with a high inflectional system. Inflections include postpositions, number, gender and case markers on nouns, and inflections on verbs include person, tense, aspect, honorific, non-honorific, pejorative, finiteness and non-finiteness. Since syntactical bracketing is a task of shallow processing and size of the tagset is one of the important factors, only postpositions, accusative and possessive case markers on nouns have been incorporated in this tagset. To reflect only these characteristics of morphology, a separate category ‘Suffixes’ has been included to denote the inflections. When a noun or a pronoun is inflected by a suffix, the base form and inflections are separated by a plus sign (+)[1]. Verbs are categorized according to their form such as finite, non-finite etc. Multiword Expressions(MWEs) plays an important role in Natural Language Processing because the NLP is concerned with text that may interact with each other. Multiword Expressions (MWEs) have been identified with an increasing amount of interest in the field of computational linguistics and Natural Language Processing (NLP) [2]. Formal definition of Multiword Expression define by [3] as: Multiword expressions (MWEs) are lexical items that: (a) can be decomposed into multiple lexemes, and (b) display lexical, syntactic, semantic, pragmatic or statistical idiomaticity. MWEs are characterized by non-compositionality, non substitutability and non-modifiability [4]. We are developing an Annoted corpus for Multiword Expressions (MWEs) detection to improves the efficiency of MWEs detection. Thus, POS tagging help in annotation of Bangla text to form a syntactical Treebank. While tagging, pure lexical category of a word has been preferred to be taken into consideration so far [5;6], because it ensures the consistency in tagging and reduces the confusion involved in manual tagging. It is also helpful for a machine to establish a word-tag relation which leads to efficient machine learning. 2. LITERATURE SURVEY FOR INDIAN LANGUAGES Compared to Indian languages, foreign languages like English, Arabic and other European languages have many POS taggers [7]. Literature shows that, for Indian languages, POS taggers were developed only in Hindi, Bengali, Panjabi and Dravidian languages. In comparison to the development in the field NLP, large annotated corpus is slowly growing in Bengali( Bangla), some recent works on experimenting stochastic models [8][9][10] have achieved higher accuracy in automatic POS tagging. It has been shown that the accuracy of the POS tagger can be significantly improved by integrating morphological analyzer, prefix/suffix information, name entity recognizer etc. 3. MOTIVATION FOR THE IDENTIFICATION OF MWEs IN BENGALI Since many difficulties arise in Bengali POS that motivate us to work on MWEs detection in Bengali. Some examples of MWEs which are difficult in POS tagging are words like ক গ (kany laga) which means ‘interesting’, ক কট (kan kata) which means ‘shamless’,ত থক (hat taka) which means ‘right’,উঠন ত ে (utanto mulo potony chena jaey) which means ‘morning shows the day’, and so on. Good morphological analyzers, POS taggers, stemmer and annotated corpus etc are not yet available in this task. Bengali is highly versatile language providing one of the most challenging sets of linguistics and rich statistical features resulting in Complex and long word formation. In spite of other Natural language Processing (NLP) task like Information retrieval, Text summarization and Machine translation etc, in Bengali it is needed to identify MWEs along with their detection and extraction process from different domain. 4. STEPS TO POS TAGGING The first step towards POS tagging is morphological analysis of the words. For this a Noun Analysis and verb Analysis of the words have been done. Nouns are divided into three paradigms according to their endings, these three paradigms are further classified into two groups depending on the feature ± animate. The suffixes are then classified based on number, postposition and classifier information. Verbs are classified into 6 paradigms based on morphosyntactic alternation of the root. The suffixes