Improving Statistical Word Alignments with Morpho-syntactic Transformations Adri` a de Gispert 1 , Deepa Gupta 2 , Maja Popovi´ c 3 , Patrik Lambert 1 , Jose B. Mari˜ no 1 , Marcello Federico 2 , Hermann Ney 3 , and Rafael Banchs 1 1 TALP Research Center, Universitat Polit` ecnica de Catalunya, Barcelona, Spain 2 ITC-irst, Centro per la Ricerca Scientifica e Tecnologica, Trento, Italy 3 Lehrstuhl f¨ ur Informatik 6, RWTH Aachen University, Aachen, Germany Abstract. This paper presents a wide range of statistical word align- ment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS- tagging, lemmatization or stemming, we explore which linguistic infor- mation helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing mor- phosyntactic information are bigger in case of data scarcity, but signif- icant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability. 1 Introduction Word aligned corpora are useful in a variety of fields. An obvious one is automatic extraction of bilingual lexica and terminology [1]. Word sense disambiguation is another application [2], since ambiguities are distributed differently in different languages. Word aligned corpora can also help for transferring language tools to new languages. In Yarowsky and Wicentowski [3], text analysis tools such as morphologic analyzers or part-of-speech taggers are projected to languages where such resources do not exist. Kuhn [4] presents a study of ways for exploiting statistical word alignment for grammar induction. In statistical machine translation (SMT), word alignment is a crucial part of the training process. In approaches based on words [5], phrases [6] or n- grams [7], the basic translation units are indeed extracted from statistical word alignment [8]. Some syntax-based SMT systems [9] also rely on word alignment to estimate tree-to-string or tree-to-tree alignment models. Och and Ney [10] have shown that translation quality depends on word align- ment quality In this paper we study ways of improving alignment quality through the incor- poration of morpho-syntactic information. This type of information has already T. Salakoski et al. (Eds.): FinTAL 2006, LNAI 4139, pp. 368–379, 2006. c Springer-Verlag Berlin Heidelberg 2006