Statistical Part-of-Speech Guessing for German: Support Vector Classifiers versus Voting David Reitter University of Potsdam Dept. of Linguistics / Applied Computational Linguistics P.O. Box 601553 / D-14415 Potsdam / Germany reitter@ling.uni-potsdam.de Abstract In this paper, I present a statistical-based approach to the part-of-speech guessing problem. I see assigning a part-of-speech, such as Adjective or Noun, as a classifi- cation problem. My guessing framework, which relies on automated learning of a language model, is described in detail. The rich feature analysis presented is suitable for linguistic data, such as the ones observed in German. I use a large margin classifier learning algorithm to se- lect relevant features and learn appropriate labelling. The system is evaluated using a German corpus. 1 Introduction Part-of-speech guessing algorithms are designed to as- sign one or more possible part-of-speech categories, such as finite-verb, adjective or common noun, to a given word not yet contained in a given lexicon. Such a task must be accomplished during part-of- speech-tagging, a process that assigns unambiguous part- of-speech tags to the words of a given sentences. Be- cause a word usually has a variety of part-of-speech cat- egories, tagging means much more than merely look- ing up the words in a lexicon. The actual category de- pends on the syntactic (and sometimes semantic) con- text. So, tagging performs morpho-syntactical disam- biguation. Tagging algorithms usually access a lexicon to first assign ambiguity classes (several categories) to each word, thus reducing search space. Then, an optimal solution for the assignment is found with the help of a previously learned statistical language model or defined or learned rules. A good lexicon is vital to the success of the task. Usu- ally, full form lexicons are used; they can be acquired through a corpus that is initially tagged with an existing P.O.S.-tagger and then validated manually. However, the lexicon will never contain all words found in a text. In- creasing the lexicon size will not improve the situation much: Following Zipf’s law 1 (Brill 1995), increasing the lexicon size will not result in a notable gain in lexicon performance on unrestricted text. There are relatively few high-frequent words and increasingly many words 1 Zipf’s law: When we put our lexicon in a ascending order accord- ing to word frequency, the rank of an element divided by its occurrence frequency is constant. when I look at low-frequent entries in my lexicon. In the Brown corpus, 44 percent of all words occur only once. 2 Thus, part-of-speech-guessing algorithms are needed in order to replace huge parts of the lexicon. These methods will also cover new word-formations which can never be included in a static lexicon. In the following, I will first look at previous ap- proaches and evaluate them on a theoretical basis with regard to a rich inflectional system such as the one of German. This will lead us to types of features we need to identify when guessing the part-of-speech category of a word. I will then apply the Support Vector Machine technique to this problem in order to implement a super- vised learning mechanism. The method is evaluated on a lexicon derived from the German NEGRA corpus. 2 Common Guessing Approaches All guessing approaches described in the following use a form of rules, which are triggered by word affixes. Mor- phological rules refer to an entry in a lexicon. They ac- count for morphological alternation, such as inflection or derivation. Non-Morphological rules do not refer to any lexicon entries. Thus, they can also cope with newly in- vented words or spelling mistakes. Eric Brill describes a transformation-based tagger and a transformation-based guessing mechanism. His tagger and guesser are error-driven, i.e. the rules employed in his algorithms revise in multiple iterations the previous assignment of arbitrarily chosen tags. For the guesser, this means that all unknown words may be, e.g., first tagged as ‘common noun’ (which is the most frequent unknown word). This category is then changed depend- ing on certain preconditions stated in the rules. The rules are learned from a corpus; they are instantiations of the following templates: • Deleting the prefix/suffix x, |x| < 4 results in a known word. • The first/last n (1 <n< 4) characters of the word are x. • Adding the character string x as a prefix/suffix re- sults in a word (|x| < 4). 2 Cited after Daciuk et al. (1998).