Technical report: OpenMaTrEx, a free, open-source hybrid data-driven machine translation system * Pratyush Banerjee Sandipan Dandapat Mikel L. Forcada † Declan Groves Sergio Penkale John Tinsley Andy Way Centre for Next Generation Localisation, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland Version: Wednesday 15 th June, 2011 Abstract This report describes OpenMaTrEx, a free/open-source hybrid data-driven machine translation system containing core example-based components based on the marker hypothesis. OpenMaTrEx com- prises a marker-driven chunker, a collection of chunk aligners, tools to merge (“hybridise”) marker-based and statistical translation tables, two engines —a simple proof-of-concept monotone “example-based” recombination engine and a statistical decoder based on Moses —, and support for automatic evaluation. It also contains support for “word packing” to improve alignment. OpenMaTrEx is a free/open-source release of basic components of MaTrEx, the Dublin City University machine translation system. The components and processes imple- mented in OpenMaTrEx are described in both theoretical and func- tional detail. Additionally, experimental results are shown in which OpenMaTrEx is compared to plain statistical machine translation on representative tasks. 1 Introduction This report describes OpenMaTrEx, a hybrid data-driven (or corpus- based ) free/open-source machine translation system containing core example- based components based on the marker hypothesis (Green, 1979). It com- prises a marker-driven chunker, a collection of chunk aligners, tools to merge * This report is an extended version of a preliminary presentation of OpenMaTrEx at IceTAL (Dandapat et al., 2010), containing a more detailed description of hybridization and of the training and translation processes, and results of additional experiments. † Permanent address: Grup Transducens, Dept. Llenguatges i Sistemes Inform´ atics, Universitat d’Alacant, E-03071 Alacant, Spain 1