Bhasa Bijnan o Prayukti: An International Journal on Linguistics and Language Technology Vol. 1, No. 1, Jan-Jun 2013, Pp. 53-96 1 Part-of-speech (POS) Tagging of Bengali Written Text Corpus Niladri Sekhar Dash Linguistic Research Unit, Indian Statistical Institute, Kolkata, India Email: ns_dash@yahoo.com Abstract In Natural Language Processing, Language Technology, and Computational Linguistics, the idea of part-of- speech (POS) tagging is linked with assigning words to particular parts-of-speech. It is also known as Grammatical Annotation (GA) and Word Category Disambiguation (WCD), the primary task of which involves the process of marking each word with a tag – manually or automatically – within a piece of natural text as corresponding to a particular part-of-speech, based on its form and function in contexts of its use in larger syntactic frames like phrases, sentences, paragraphs, and texts. An electronically developed language database, after being tagged at the part-of-speech level, becomes a valuable resource for various works of natural language processing, language technology, computational linguistics, machine learning, cognitive linguistics, applied linguistics, and descriptive linguistics. POS tagging is generally carried out on electronic version of language corpora (either manually or automatically) using a set of predefined tags, which are primarily used to assign part-of-speech, linguistic properties, and functions of words, terms, and other lexical items used in corpora. The issue of assigning part-of-speech to words, although appears to be simple, straightforward and one dimensional, is in fact, embedded with several theoretical and technical complexities with regard to identification of actual lexico-semantic entities and syntactico-grammatical functions of words used in a piece of text. Also, it includes defining the basic hierarchical modalities of tag assignment and designing a rule-based schema useful for automatic assignment of tags to words. All these issues ask for a highly synchronized strategy designed elegantly with proper combination of linguistic and extralinguistic knowledgebase and computational expertise for achieving maximum success with minimum enterprise. Keeping these issues in view, in this paper we have made an attempt to define maxims, principles, and rules we need to follow if we try to design a tagset for tagging words at the part-of-speech level in Bengali corpus. We also define strategies we need to adopt when we try to tag words manually as well as try to formulate algorithms meant for automatic assignment of POS tags to words used in the Bengali corpus. We also address some other issues relating to tag assignment to words in the corpus with reference to some examples elicited from the Bengali corpus. Although we have addressed some important theoretical issues related to the field of inquiry within its general scope as well as have proposed some rudimentary rules to be followed at the time of POS tagging, we have deliberately avoided all technical issues and computational aspects of POS tagging with a clear expectation for making the topic of investigation an open area of general interest and enquiry. However, we have meticulously addressed almost all the major linguistics and extralinguistics issues relating to the process of manual POS tagging elaborately referring to the instances taken from the Bengali written text corpus. We hope that this paper will be treated as a ‘genesis paper’ the principal goal of which is to clarify the very concept of POS tagging; identify the maxims and principles of POS tagging; highlight the pros and cons of the tagset proposed by the Bureau of Indian Standard (BIS); define the strategies and rules for POS tagging of Bengali corpus; and make aware the new generation of scholars who are going to be engaged in the task of POS tagging of Bengali corpus about the invisible quicksand on the path of their journey. Key Words: annotation, tagging, noun, verb, adjective, adverb, postposition, Bengali, part- of-speech, morphology, syntax, semantics, context 1. Introduction In the areas of Natural Language Processing (NLP), Language Technology (LT), and Computational Linguistics (CL), the process of Part-of-Speech (POS) tagging is also identified as Grammatical Annotation (GA) the primary goal of which is to disambiguate words at the