Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 287–294, Sydney, July 2006. c 2006 Association for Computational Linguistics Low-cost Enrichment of Spanish WordNet with Automatically Translated Glosses: Combining General and Specialized Models Jes ´ us Gim´ enez and Llu´ ıs M` arquez TALP Research Center, LSI Department Universitat Polit` ecnica de Catalunya Jordi Girona Salgado 1–3, E-08034, Barcelona {jgimenez,lluism}@lsi.upc.edu Abstract This paper studies the enrichment of Span- ish WordNet with synset glosses automat- ically obtained from the English Word- Net glosses using a phrase-based Statisti- cal Machine Translation system. We con- struct the English-Spanish translation sys- tem from a parallel corpus of proceed- ings of the European Parliament, and study how to adapt statistical models to the do- main of dictionary definitions. We build specialized language and translation mod- els from a small set of parallel definitions and experiment with robust manners to combine them. A statistically significant increase in performance is obtained. The best system is finally used to generate a definition for all Spanish synsets, which are currently ready for a manual revision. As a complementary issue, we analyze the impact of the amount of in-domain data needed to improve a system trained en- tirely on out-of-domain data. 1 Introduction Statistical Machine Translation (SMT) is today a very promising approach. It allows to build very quickly and fully automatically Machine Trans- lation (MT) systems, exhibiting very competitive results, only from a parallel corpus aligning sen- tences from the two languages involved. In this work we approach the task of enriching Spanish WordNet with automatically translated glosses 1 . The source glosses for these translations are taken from the English WordNet (Fellbaum, 1 Glosses are short dictionary definitions that accompany WordNet synsets. See examples in Tables 5 and 6. 1998), which is linked, at the synset level, to Span- ish WordNet. This resource is available, among other sources, through the Multilingual Central Repository (MCR) developed by the MEANING project (Atserias et al., 2004). We start by empirically testing the performance of a previously developed English–Spanish SMT system, built from the large Europarl corpus 2 (Koehn, 2003). The first observation is that this system completely fails to translate the specific WordNet glosses, due to the large language varia- tions in both domains (vocabulary, style, grammar, etc.). Actually, this is confirming one of the main criticisms against SMT, which is its strong domain dependence. Since parameters are estimated from a corpus in a concrete domain, the performance of the system on a different domain is often much worse. This flaw of statistical and machine learn- ing approaches is well known and has been largely described in the NLP literature, for a variety of tasks (e.g., parsing, word sense disambiguation, and semantic role labeling). Fortunately, we count on a small set of Spanish hand-developed glosses in MCR 3 . Thus, we move to a working scenario in which we introduce a small corpus of aligned translations from the con- crete domain of WordNet glosses. This in-domain corpus could be itself used as a source for con- structing a specialized SMT system. Again, ex- periments show that this small corpus alone does not suffice, since it does not allow to estimate good translation parameters. However, it is well suited for combination with the Europarl corpus, to generate combined Language and Translation 2 The Europarl Corpus is available at: http://- people.csail.mit.edu/people/koehn/publications/europarl 3 About 10% of the 68,000 Spanish synsets contain a defi- nition, generated without considering its English counterpart. 287