Wordnet.Br: An Exercise of Human Language Technology Research Bento Carlos Dias-da-Silva CELiC - Faculdade de Ciências e Letras, Universidade Estadual Paulista Rodovia Araraquara-Jau Km 1 14800-901 Araraquara, São Paulo, Brazil, bento@fclar.unesp.br Abstract This paper reports the ongoing project (since 2002) of devel- oping a wordnet for Brazilian Portuguese (Wordnet.Br) from scratch. In particular, it describes the process of constructing the Wordnet.Br core database, which has 44,000 words or- ganized in 18,500 synsets Accordingly, it briefly sketches the project overall methodology, its lexical resourses, the synset compilation process, and the Wordnet.Br editor, a GUI (graphical user interface) which aids the linguist in the compilation and maintenance of the Wordnet.Br. It con- cludes with the planned further work. Introduction Assuming a compromise between Human Language Tech- nology and Linguistics, and based on the Artificial In- telligence notion of Knowledge Representation Systems (Hayes-Roth, 1990, Durkin, 1994), this project applies a three-domain approach methodology to the development of the Brazilian Portuguese (BP) WordNet (Wordnet.Br). 1 This approach claims that the linguistic-related information to be computationally modelled, like a rare metal, must be “mined”, “molded”, and “assembled” into a computer- tractable system (Dias-da-Silva, 1998). Accordingly, the processes of designing and implementing the Wordnet.Br lexical database are being developed in the following com- plementary domains: • The Linguistic-related Domain, where the lexical re- sources (dictionaries and text corpuses), the lexical- conceptual relations (synonymy, antonymy, hyponymy, meronymy, entailment, cause), and a sort of natural language ontology of concepts (“Base Concepts” and “Top Ontology”) 2 are mined; • The Representational Domain, where the overall infor- mation selected and organized in the preceeding do- main is molded into a computer-tractable representa- tion (the “synsets”, the “lexical matrix”, and the word- net “lexical database” itself) 3 ; 1 This project was supported in part by contract 552057/01, with funding provided by The National Council for Scientific and Technological Devel- opment (CNPq); in part by grant 2003/03623-7 from The State of São Paulo Research Foundation (FAPESP). 2 Rodríguez et al. (1998). 3 Fellbaum (1998). • The Computational Domain, where the computer- tractable representations are assembled by means of utilities (the Wordnet.Br editor). This paper, in particular, reports the first part of the project where in a two-year span the effort of three linguists and a computer scientist, each working in his respective domain, managed to compile the Wordnet.Br core database: 44,000 BP words organized in 18,500 synsets. In other words, the core database is a thesaurus-like lexical database. 1 The Linguistic-related Domain 1.1 Synonymy in Context The Wordnet.Br core database architecture conforms to the two key representations of the Princeton WordNet (Fell- baum, 1998): the synset and the lexical matrix. Its synsets are built on the basis of the notion of “syn- onymy in context”, i.e. word interchangeability in context (Miller, 1998). Antonymy is checked either against morpho- logical properties of words and their dictionary lexicograph- ical information. The notion of lexical matrix (Miller and Fellbaum, 1991) is intended to capture the “many to many” associations between form and meaning. 1.2 The Reference Corpus Given the team of three linguists, the unavailability of reusing machine-readable dictionaries and other existing wordnets, 4 and a two-year deadline to present large-acale re- sults, the Wordnet.Br developers manually reused, merged, and tuned synonymy and antonymy information registered in five outstanding published dictionaries of BP: 5 Fer- reira (1999), Weiszflog (1998), Barbosa (1999), Nascentes (1981), and Borba (1990). BP texts available in the NILC Corpus (CETENFolha, 2004) and in the web complemented the project reference corpus. To understand how the linguits “mined” for synsets into the reference corpus, let us follow an example. Weiszflog (1998) distinguishes seven senses of the verb lembrar (En- glish: “to remember”). After collecting the synonyms, and 4 Copyright reasons prevented us from reusing or adopting existing wordnet databases and utilities. 5 The dictionaries were chosen for their pervasive use of synonyny and antonymy to define word senses. In a way, this choice dictated the way to proceed the work alphabetically, instead of working by semantic fields. Petr Sojka, Key-Sun Choi, Christiane Fellbaum, Piek Vossen (Eds.): GWC 2006, Proceedings, pp.301–303. c Masaryk University, 2005