Identifying bilingual Multi-Word Expressions for Statistical Machine Translation Dhouha Bouamor 1,2,3 , Nasredine Semmar 1 , Pierre Zweigenbaum 2,3 1 CEA-LIST, Vision and Content Engineering Laboratory, F91191 Gif sur Yvette Cedex, France 2 LIMSI-CNRS, F-91403 Orsay France 3 Univ. Paris Sud, Orsay, France dhouha.bouamor@cea.fr, nasredine.semmar@cea.fr, pz@limsi.fr Abstract MultiWord Expressions (MWEs) repesent a key issue for numerous applications in Natural Language Processing (NLP) especially for Machine Translation (MT). In this paper, we describe a strategy for detecting translation pairs of MWEs in a French-English parallel corpus. In addition we introduce three methods aiming to integrate extracted bilingual MWES in MOSES, a phrase based Statistical Machine Translation (SMT) system. We experimentally show that these textual units can improve translation quality. Keywords: bilingual Multi-Word Expression, Vector Space Model, Statistical Machine Translation 1. Introduction A Multi-Word Expression (MWE) can be defined as a com- bination of words for which syntactic or semantic proper- ties of the whole expression can not be obtained from its parts (Sag et al., 2002). Such units are made up of colloca- tions (cordon bleu), expressions more or less frozen (kick the bucket), named entities (New York) etc. (Sag et al., 2002; Constant et al., 2011). They are numerous and con- stitute a significant portion of the lexicon of any natural language. (Jackendoff, 1997) claims that the frequency of MWES in a speaker’s lexicon is almost equivalent to the frequency of single words. While easily mastered by na- tive speakers, their interpretation poses a major challenge for NLP applications especially those addressing semantic aspects of language. For Statistical Machine Translation (SMT) systems, vari- ous improvements of translation quality were achieved with the emergence of phrase based approaches (Koehn et al., 2003). Phrases are defined as simply arbitrary n-grams with no sophisticated linguistic motivation consistently trans- lated in a parallel corpus. In such systems, the lack of an adequate processing of MWES could affect the translation quality. In fact, the literal translation of an unrecognized expression is the source of an erroneous and incomprehen- sible translation. For example, it would suggest “way of iron“ as a translation of “chemin de fer “ instead of “rail- way“. It is therefore important to make use a lexicon in which MWES are handled. But such kind of resource is not necessarily available in all languages, and if they ex- ist, as described (Sagot et al., 2005), they do not cover all MWES of a given language. In this paper, we consider any non-compositional contigu- ous sequence, belonging to one of the three classes de- fined by (Luka et al., 2006), as a MWE. Classes of MWEs were distinguished on the basis of their categorical prop- erties and their syntactic and semantic congealing degrees and are made up of compounds, idiomatic expressions and collocations. Based on this classification, we present a method combining linguistic and statistical information to extract and align MWES in a French-English parallel cor- pus aligned at the sentence level. Then, we introduce three methods aiming to integrate extracted bilingual MWES into MOSES, the state-of-the-art phrase based SMT system and study in what respect we can improve translation qual- ity by the use of such units. The remainder of this paper is organized as follows: the next section (section 2) describe in some details previ- ous works addressing the task of semantically equivalent translations extraction and their applications. In section 3, we introduce a method for identifying French and English MWES and then present, in section 4, the algorithm we im- plemented to acquire translation pairs of MWEs and report our evaluation results. In section 5 three methods aiming to integrate MWES in an SMT system are introduced and obtained results are discussed. We, finally, conclude and present our future work, in section 6. 2. Related Work In recent years, a number of techniques have been ap- plied to the task of bilingual MWES extraction from par- allel corpora. Most works start by identifying monolingual MWE candidates then, apply different alignment meth- ods to acquire bilingual correspondences. Monolingual extraction of MWES techniques revolve around three ap- proaches: (1) symbolic methods relying on morphosyn- tactic patterns (Okita et al., 2010; Dagan and Church, 1994); (2) statistical methods which use association mea- sures to rank MWE candidates (Vintar and Fisier, 2008) and (3) Hybrid approaches combining (1) and (2) (Wu and Chang, 2004; Seretan and Wehrli, 2007; Daille, 2001; Boulaknadel et al., 2008). None of the approaches is with- out limitations. It is difficult to apply symbolic methods to data without syntactic annotations. Furthermore, due to corpus size, statistical measures have mostly been applied to bigrams and trigrams, and it become more problematic to extract MWES of more than three words. Concerning the alignment task, noumerous approaches have already been introduced to deal with this problem. 674