Technical report: OpenMaTrEx, a free, open-source hybrid data-driven machine translation system * Pratyush Banerjee Sandipan Dandapat Mikel L. Forcada Declan Groves Sergio Penkale John Tinsley Andy Way Centre for Next Generation Localisation, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland Version: Wednesday 15 th June, 2011 Abstract This report describes OpenMaTrEx, a free/open-source hybrid data-driven machine translation system containing core example-based components based on the marker hypothesis. OpenMaTrEx com- prises a marker-driven chunker, a collection of chunk aligners, tools to merge (“hybridise”) marker-based and statistical translation tables, two engines —a simple proof-of-concept monotone “example-based” recombination engine and a statistical decoder based on Moses —, and support for automatic evaluation. It also contains support for “word packing” to improve alignment. OpenMaTrEx is a free/open-source release of basic components of MaTrEx, the Dublin City University machine translation system. The components and processes imple- mented in OpenMaTrEx are described in both theoretical and func- tional detail. Additionally, experimental results are shown in which OpenMaTrEx is compared to plain statistical machine translation on representative tasks. 1 Introduction This report describes OpenMaTrEx, a hybrid data-driven (or corpus- based ) free/open-source machine translation system containing core example- based components based on the marker hypothesis (Green, 1979). It com- prises a marker-driven chunker, a collection of chunk aligners, tools to merge * This report is an extended version of a preliminary presentation of OpenMaTrEx at IceTAL (Dandapat et al., 2010), containing a more detailed description of hybridization and of the training and translation processes, and results of additional experiments. Permanent address: Grup Transducens, Dept. Llenguatges i Sistemes Inform´ atics, Universitat d’Alacant, E-03071 Alacant, Spain 1