Using Parallel Corpora for Word Sense Disambiguation Els Lefever a, b V´ eronique Hoste a, b, c Martine De Cock b a LT3, Language and Translation Technology Team, University College Ghent b Dept. of Applied Mathematics and Computer Science, Ghent University c Dept. of Linguistics, Ghent University * The full paper version of this paper is published in the Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, June 19-24, 2011. 1 Introduction Word Sense Disambiguation (WSD) is the Natural Language Processing (NLP) task that consists in selecting the correct sense of a polysemous word in a given context. Most state-of-the-art WSD systems are supervised classifiers that are trained on manually sense-tagged corpora, which are very time-consuming and expensive to build. In order to overcome this acquisition bottleneck (sense-tagged corpora are scarce for languages other than English), we decided to take a multilingual approach to WSD, that builds up the sense inventory on the basis of the Europarl parallel corpus [3]. Using translations from a parallel corpus implicitly deals with the granularity problem as finer sense distinctions are only relevant as far as they are lexicalized in the target translations. It also facilitates the integration of WSD in multilingual applications such as multilingual Information Retrieval (IR) or Machine Translation (MT). 2 Experimental Setup Starting point of the experiments was the six-lingual sentence-aligned Europarl corpus that was used in the SemEval-2010 “Cross-Lingual Word Sense Disambiguation” (CLWSD) task [4]. The task is a lexical sample task for twenty English ambiguous nouns that consists in assigning a correct translation in the five supported target languages (viz. French, Italian, Spanish, German and Dutch) for an ambiguous focus word in a given context. In order to detect the relevant translations for each of the twenty ambiguous focus words, we ran GIZA++ [5] with its default settings for all focus words. This word alignment output was then considered to be the label for the training instances for the corresponding classifier (e.g. the Dutch translation is the label that is used to train the Dutch classifier). By considering this word alignment output as oracle information, we redefined the CLWSD task as a classification task. To train our five classifiers (English as input language and French, German, Dutch, Italian and Spanish as focus languages), we used the memory-based learning (MBL) algorithm implemented in TIMBL [1], which has successfully been deployed in previous WSD classification tasks [2]. For our feature vector creation, we combined a set of English local context features and a set of binary bag-of-words features that were extracted from the aligned translations. First all English sentences were preprocessed by means of a memory-based shallow parser (MBSP) [1] that performs tokenization, Part-of- Speech tagging and text chunking. The preprocessed sentences were used as input to build a set of commonly used WSD features related to the English input sentence: (a) features related to the focus word itself being the word form of the focus word, the lemma, Part-of-Speech and chunk information and (b) local context features related to a window of three words preceding and following the focus word containing for each of these words their full form, lemma, Part-of-Speech and chunk information.