1 Neural Networks, Part-of-Speech Tagging and Lexicons * Nuno C. Marques ** (nmm@di.fct.unl.pt) Gabriel Pereira Lopes (gpl@di.fct.unl.pt) Technical Report DI-FCT/UNL n. 6/98 Universidade Nova de Lisboa - Faculdade de Ciências e Tecnologia Departamento de Informática Grupo de Língua Natural 2825 Monte de Caparica Portugal http://www-ia.di.fct.unl.pt/~nmm/artigos.html Abstract Neural networks are one of the most efficient techniques for learning from scarce data. This property is very useful when trying to build a part-of-speech tagger. Available part-of-speech taggers need huge amounts of hand tagged text, but for Portuguese as well as for many other languages there are no such hand tagged corpora available. In this paper we propose the cooperation of a lexical system and a neural network in such a way that the huge training corpus problem is overcome. The network topology we used was applied to the problem of learning the parameters of a part-of-speech tagger from a very small Portuguese training corpus and from a subset of the Susanne Corpus. The experiments carried out are discussed. The results obtained point to a correction rate above 97% when we start from a hand tagged training corpus with approximately 15,000 words. The application of our system to real texts is also described. 1. Introduction The application potential of textual corpora increases, when the corpora are annotated. The first logical level of annotation is usually part-of-speech tagging. At an upper level the text is no longer seen as a mere sequence of strings and is taken as a sequence of linguistic entities with some natural meaning. The annotated text can then be used to introduce further some new types of annotations (usually by means of syntactic parsing [Marcus et al. 93] or [Marcken, 90] ), or may directly (or indirectly) be used to collect statistics to different kinds of applications. Working at the word tagging level enabled applications such as: speech synthesis [Church et al.,93], clustering [Pereira et al., 93] computational lexicography [Manning, 93], even improve spell checking. The success of this kind of technique is certainly due to its intrinsic capability for assigning a sequence of part- of-speech tags to any sequence of words with high levels of precision using quite modest computer resources. Despite this, part-of-speech taggers are not yet as fully available as they should, especially when we are working with languages other than English. The main problem with currently available part-of-speech taggers is the lack of tagged corpora: almost every tagger needs huge amounts of hand tagged text. * Work partially supported by the projects Corpus (funded by JNICT under contract number PLUS/C/LIN/805/93) and project ESA: tagging and segmentation of medieval Portuguese Corpora (funded by JNICT under contract number FCSH/C/LIN/931/95). ** Work supported by PhD scholarship JNICT-PRAXIS XXI/BD/2909/94.