Part of Speech Tagging with Naïve Bayes Methods R. Crețulescu, A. David, D. Morariu, L. Vințan Computer Science and Electrical Engineering Department “Lucian Blaga” University of Sibiu Sibiu, Romania {radu.kretzulescu, alexandru.david, daniel.morariu, lucian.vintan}@ulbsibiu.ro Abstract— In this paper we have focused on the problem of automatic prediction of parts of speech in sentences. We present an experimental framework which includes the analysis and the implementation of methods for part of speech (POS) labeling (tagging). We have tested three methods that predict the POS without current word’s context and also three context awareness statistic methods. The main goal of our work was to evaluate the three statistical methods Forward, Backward and Complete Method in order to analyze their applicability in the problem of automatically prediction of the POS. These methods are derived from the classic Naïve Bayes classifier. In our research we have used the WordNet database and a set of benchmarks called the Brown University Standard Corpus of Present - Day American English. The results obtained by the non-context-awareness methods compared to the results obtained by statistical methods are better but not so reliable like the statistical methods. Keywords— NLP, Naïve Bayes, Part Of Speech Prediction I. INTRODUCTION In the field of Word Sense Disambiguation (WSD) [2] there were identified a range of linguistic phenomena such as preferential selection or domain information that are relevant in resolving the ambiguity of words. These properties are called linguistic knowledge sources. Current WSD system reports does not mention these sources but rather presents low-level features such as representations of "bag-of-words" or "n- grams" [3] used in disambiguation algorithms - one of the reasons being that the features (coding) incorporate more than one source of knowledge. A lot of research in Natural Language Processing (NLP) [5] is focused especially on intermediate tasks that use known structures that are inherent in the language. Such a task is the part of speech labeling (or tagging). This process involves assigning a label to each word in a sentence; the label represents the part of speech (POS) of that word. In this paper we have focused on the problem of automatic detection of the parts of speech within an English text in order to discover some semantic features of the sentence. For this, we have used the WordNet database [10] which is often used in WSD algorithms and a set of benchmarks called the Brown University Standard Corpus of Present - Day American English (or Brown Corpus) [1]. In this paper we present an experimental framework which includes the analysis and the implementation of algorithms for POS labeling. The POS prediction is a difficult task even for the English language because a significant part of English words (approximately 33%) are words that without a given context can have multiple syntactic forms (multiple parts of speech). Unfortunately these words are among the most commonly used in colloquial language. For example, tests on the Brown Corpus benchmarks [1] have obtained a frequency of 40% words [9] that have more than one POS. At the sentence/phrase level we have found that on average 60% of the words in a sentence have multiple parts of speech. This percentage is high enough to change the semantics of a sentence if the parts of speech are misunderstood. Solving this ambiguity is important because it is the basis for natural language processing applications such as machine translation and word disambiguation, etc. [7]. To this moment there is no automated solution for labeling parts of speech with perfect accuracy [8]. II. THE EXPERIMENTAL FRAMEWORK A. The used datasets We have developed our own framework which uses as input the Standard Corpus of Present-Day American English (Brown Corpus [1]). The Brown Corpus represents a general collection of texts that is used in natural language processing research and has been manually created by professors Henry Kucera and W. Nelson Francs. We have divided the data provided by the Brown Corpus into two sets. A set containing 70% of texts, chosen randomly, will be used for training and the other set, which contains the remaining 30% of texts, will be used for testing. To simplify the usage and interpretation of the tags related to the words extracted from the Brown Corpus [1] we have decided to reduce the number of tags from the 82 tags to 5 tags. Our selected tags are: noun, verb, adverb, adjective and other (for any other POS). We chose those tags because we also used the WordNet database [10] which offer support only for the first four tags. B. The application architecture Our developed software architecture is specialized on the problem of grammatical analysis. It is created as a modular architecture which allows easy integration of disambiguation algorithms. The architecture provides a number of facilities for them such as extraction and pre-processing modules, a module for identifying tags for words using the WordNet and a module for evaluating the tagging accuracy. The prediction of the POS is achieved using only WordNet or combination between WordNet and Brown Corpus. Also we