Text-Translation Alignment: Three Languages Are Better Than Two * Michel Simard Laboratoire de recherche appliqu4e en linguistique informatique (RALI) Universit4 de Montr6al S imardM©IRO. UMontreal. CA Abstract In this article, we show how a bilingual text- translation alignment method can be adapted to deal with more than two versions of a text. Experiments on a trilingual corpus demonstrate that this method yields better bilingual align- ments than can be obtained with bilingual text- alignment methods. Moreover, for a given num- ber of texts, the computational complexity of the multilingual method is the same as for bilin- gual alignment. Introduction While bilingual text corpora have been part of the computational linguistics scene for over ten years now, we have recently witnessed the ap- pearance of text corpora containing versions of texts in three or more languages, such as those developed within the CRATER (McEnery et al., 1997), MULTEXT (Ide and V4ronis, 1994) and MULTEXT-EAST (Erjavec and Ide, 1998) projects. Access to this type of corpora raises a number of questions: Do they make new ap- plications possible? Can methods developed for handling bilingual texts be applied to multilin- gual texts? More generally: is there anything to gain in viewing multilingual documents as more than just multiple pairs of translations? Bilingual alignments have so far shown that they can play multiple roles in a wide range of linguistic applications, such as computer as- sisted translation (Isabelle et al., 1993; Brown et al., 1990), terminology (Dagan and Church, 1994) lexicography (Langlois, 1996; Klavans and Tzoukermann, 1995; Melamed, 1996), and cross-language information retrieval (Nie et al., * This research was funded by the Canadian De- partment of Foreign Affairs and International Trade (http://~.dfait-maeci.gc.ca/), via the Agence de la francophonie (http://~. franeophonie, orE) 1998). However, the case for trilingual and mul- tilingual alignments is not as clear. True multi- lingual resources such as multilingual glossaries are not widely used, and most of the time, when such resources exist, the real purpose is usually to provide bilingual resources for multiple pairs of languages in a compact way. What we intend to show here is that while multilingual correspondences may not be inter- esting in themselves, multilingual text align- ment techniques can be useful as a means of extracting information on bilingual correspon- dences. Our idea is that each additional version of a text should be viewed as valuable informa- tion that can be used to produce better align- ments. In other words: whatever the intended application, three languages are better than two (and, more generally: the more languages, the merrier!). After going through some definitions and pre- liminary material (Section 1), we present a gen- eral method for aligning three versions of a text (Section 2). We then describe some experiments that were carried out to evaluate this approach (Section 3) and various possible optimizations (Section 4). Finally, we report on some disturb- ing experiments (Section 5), and conclude with directions for future work. 1 Trilingual Alignments There are various ways in which the concept of alignment can be formalized. Here, we choose to view alignments as mathematical relations between linguistic entities: Given two texts, A and B, seen as sets of linguistic units: A = {al,a2,...,am} and B = {bl, b2, ...,bn}, we define a binary alignment XAB as a relation on A tj B: XAB={(al,bl),(a2,b2),(a2,b3),...} 2