Portuguese Large-scale Language Resources for NLP Applications Elisabete Ranchhod 1 , Paula Carvalho 1 , Cristina Mota 2 , Anabela Barreiro 1 1 Universidade de Lisboa and LabEL (CAUTL/IST), 2 LabEL (CAUTL/IST) Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal {elisabet, paula, cristina, anabela}@label.ist.utl.pt Abstract The paper describes Portuguese large-scale linguistic resources, mainly computational lexicons and grammars, developed by LabEL. These resources are formalized and applied to texts by means of finite-state techniques, more and more acknowledged in Natural Language Processing. On the one hand, it illustrates methods on lexical representation for simple words and multi-word expressions; on the other hand, it provides examples (in form of concordances) of linguistic structures recognized after the application of disambiguation and parsing grammars to texts. The paper ends with a short reference to the publicly available data highlighting its contribution towards dissemination of LabEL’s knowledge on language technology. 1. Introduction The increasing interest in NLP has pointed out the growing needs for linguistically precise broad-coverage language resources. In this context, LabEL has been developing large-scale lexicons and grammars for Portuguese. These language data are formalized and applied to texts using finite-state techniques. Automata and finite-state transducers (FST) are particularly suitable for easy and compact representation of different types of linguistic data. They reduce space and time overhead in text processing operations. Finite state technology is used to: (i) Build electronic dictionaries, by formalizing and generating precise linguistic information related to simple and multi-word lexical units: e.g. linguística, linguistics; linguística computacional, computational linguistics. (ii) Develop grammars for lexical disambiguation (POS, lemma, inflectional information, etc.): e.g. ama, nurse (N) and loves (V); forte, fort (N) and strong (A); cobre, cooper (N), covers (V) and collects (V). (iii) Develop local grammars for identification and tagging of linguistic expressions with strong lexical- syntactic constraints (such as: adverbials of time and space, units of measure, etc.): e.g. há cerca de dez anos, about ten years ago; estão 37,5ºC à sombra, it’s 37,5ºC in the shade. (iv) Develop grammars to parse different syntactic constructions, such as noun phrases, and complex predicates. The Portuguese resources are integrated in two public FST based corpus processors, INTEX and UNITEX. 2. Electronic Dictionaries The lexicon is the foundation of any sound NLP application. LabEL’s dictionary system consists of a set of modules, each organized according to the formal complexity of the lexical units it represents. 2.1. Simple Word Dictionaries The core module of the dictionary system contains about 120,000 simple words (lemmas), each having its own systematically encoded morphological attributes. Codes specify information about the particular entry POS and inflectional information, such as gender, number, person, case, tense, mood, diminutives, augmentatives, and superlatives, that can change according to the POS involved. Sample 1 illustrates a few entries of the simple word dictionary. simples,A116+Det simples,A116.ss024.sr001+Pd sobreviver,V102x sol,N213.dh247.dt247 [simple] [simple] [survive] [sun] Sample 1: Simple word lemmas Syntactic and semantic information is being encoded incrementally. For instance, adjectives are being refined with information about their syntactic sub-classification. Such refinement allows the separation of homograph adjectival entries on a formal basis. For instance, simples, represented above, is described in two different entries, which correspond to both predicative (Pd) and determinative (Det) adjective analyses, as in examples (1) and (2) below. (1) This is a very simple question (2) He did not find a simple reason to come The inflected simple forms (about 1,250,000) are system generated from the inflectional FSTs referenced in lemmas. For instance, FST A116.ss024.sr001 allows the generation of inflected forms for the predicative adjective simples, as well as for other adjectives with similar morphological behavior (invariable adjectives which can inflect by means of the superlative morphemes –íssimo and -érrimo). FSTs also assign linguistic information corresponding to each generated form, as in Sample 2. simples,simples.A+Pd:ms:fs:mp:fp simplérrima,simples.A+Pd:Sfs simplérrimas,simples.A+Pd:Sfp simplérrimo,simples.A+Pd:Sms simplérrimos,simples.A+Pd:Smp simplicíssima,simples.A+Pd:Sfs simplicíssimas,simples.A+Pd:Sfp simplicíssimo,simples.A+Pd:Sms simplicíssimos,simples.A+Pd:Smp Sample 2: Inflection of simples 2.2 Multi-Word Dictionaries It is impossible to envisage automatic text analysis without adequate identification and treatment of multi-word lexical 1755