NomLex-PT: A Lexicon of Portuguese Nominalizations Valeria de Paiva 1 , Livy Real 2 , Alexandre Rademaker 3 , Gerard de Melo 4 1: Nuance Communications, Sunnyvale, CA, USA 2: Universidade Federal do Paran´ a, Curitiba, Brazil 3: IBM Research and FGV/EMAp, Rio de Janeiro, Brazil 4: Tsinghua University, Beijing, China valeria.depaiva@gmail.com, livyreal@gmail.com, alexrad@br.ibm.com, gdm@demelo.org Abstract This paper presents NomLex-PT, a lexical resource describing Portuguese nominalizations. NomLex-PT connects verbs to their nominalizations, thereby enabling NLP systems to observe the potential semantic relationships between the two words when analysing a text. NomLex-PT is freely available and encoded in RDF for easy integration with other resources. Most notably, we have integrated NomLex-PT with OpenWordNet-PT, an open Portuguese Wordnet. Keywords: NomLex, Portuguese, nominalizations 1. Introduction Human language is marvellously flexible in providing nu- merous alternative ways to express an idea. Often, these alternatives transcend the more conventional associations between categories of form and meaning. While events (and some states) are typically expressed by verbs, there are also many nouns that can be used to refer to them. Unfortu- nately, this flexibility also leads to significant challenges for computational systems, which typically lack information about such connections between nouns and verbs. Thus, a system encountering the noun proof might have difficulty in recognizing its semantic relationship to the verb to prove. In this paper, we describe a freely available computational lexicon for Portuguese that provides mappings between verbs and their nominalizations. While Portuguese is spoken by hundreds of millions of people, in some respects it is still a resource-poor language, especially with respect to freely available resources. The latter are particularly valuable be- cause they enable people to build on each other’s work and improve both our understanding of the language and the services that are built for the language. Our approach is two-fold. Some of our work on building lexical resources is manual and requires detailed expert anal- yses of linguistic data, which is known to sometimes be boring and frustratingly time-consuming. At the same time, some of the resources that we would like to provide already exist in other languages, and so a certain amount of trans- lation can get us quite far. In particular, we considered the English NOMLEX (Macleod et al., 1998), from which the name our resource, NomLex-PT (for Portuguese Nominal- izations Lexicon) is derived, and the French resource NO- MAGE (Balvet et al., 2009), a similar project to NOMLEX for French, more recent and based on corpus linguistics. Ad- ditionally, new cross-lingual induction algorithms are now mature enough to be applied to large crowdsourced data collections like Wiktionary, which has grown significantly in recent years. We thus set up to expand the fledgling NomLex-PT lexicon with pairs coming from Wiktionary as well as from FrameNet (Baker et al., 1998). 2. Creating the Lexicon Our basic modus operandi has been to have two researchers independently translate and revise each other’s work. We proceeded like that for the data from the NOMLEX project and the data from the NOMAGE lexicon. 2.1. The Initial NomLex-PT Core To quickly bootstrap the process of creating NomLex-PT, the initial data was manually translated from the freely avail- able English NOMLEX (Macleod et al., 1998), which con- tains 1,025 English nominalizations. We were pleasantly surprised by how straightforward the translations of the nominalizations in NOMLEX were and how frequently they seemed to correspond to nominalizations in Portuguese. The original NOMLEX has entries formed using the suffixes - ion, -ment, -al, -er, -ee and -ing. For these, we first tried to preserve to the extent possible a direct relation between the original and the translated nominalizations. Fortunately, in Portuguese, there are correspondent suffixes for most of these cases (for example, -ion/c ¸˜ ao, -ment/mento and -er/- or). This enabled a straightforward translation for around 90% of the entries. For example, we found 506 entries in NOMLEX formed via the suffix -ion and 136 formed with -ment, while the Portuguese version Nomlex-PT con- tains 466 entries formed with -c ¸˜ ao and 109 entries formed with -mento. Most of them also keep a strong relation- ship between the Portuguese and English verbal roots (e.g. construction/construc ¸˜ ao, argument/argumento). Many nominalizations are erudite words (Su, 2011), espe- cially those formed by suffixation processes. In Portuguese, the most frequent nominalizations are not formed by suf- fixation, but through zero derivation, as e.g. in the case of the words compra (buy) and luta (fight) (Rocha, 1998). As we wanted to keep our nominalizations as close as possible to the original entries, some erudite nominalizations that could be translated by a more common word in Portuguese were first translated by an erudite word, to keep the mor- phological pattern (for example, arbitration was translated to arbitrac ¸˜ ao, despite the existence of the Portuguese form arbitragem, which seems a more frequent form). But in a 2851