Edinburgh University System Description for the 2008 NIST Machine Translation Evaluation Philipp Koehn, Josh Schroeder and Miles Osborne School of Informatics University of Edinburgh Abstract The MT Group at Edinburgh University participated in the constraint task track of the 2008 NIST MT evaluation for three language pairs: Arabic–English, Chinese– English, and Urdu–English. We built systems using our Moses decoder, with MBR decoding, and large language mod- els within the limits of 32-bit machines. 1 Common Setup The basic setup is the Moses decoder (Koehn et al., 2007), which is publicly available at http://www.statmt.org/. The Moses system allows the training, tuning, and testing of statistical ma- chine translation systems, when provided with the parallel corpora such as the ones made available for the NIST evaluation campaign. From this starting point, the main challenge of the NIST evaluation is to make use of the large train- ing corpora available for the language pairs Arabic– English and Chinese–English. We were constraint in our experimentation by our 32-bit infrastructure, which limits process size to 3 GB. One key advantage of Moses is the use of on- disk translation models (Zens and Ney, 2007), which leaves the available RAM for the language model. Nevertheless, we are limited to 4-gram models trained on one billion words, while much more lan- guage modelling data is available. We are aware that better performance is possible (Brants et al., 2007), if more memory would be available, or more efﬁ- cient use of the available memory were made. While there are gains possible with language spe- ciﬁc methods, we did not make use of anything be- sides a Chinese number translator. We note that gains are possible using Chinese sentence restruc- turing (Wang et al., 2007) and basic Arabic morpho- logical preprocessing. The training regime our systems can be summa- rized as follows: • sentence length limit 80 words • GIZA++ training • word alignment heuristic grow-diag-ﬁnal-and • phrase length limit 7 • SRILM toolkit with interpolated Kneser-Ney discounting • three separate language models trained on – English side of parallel corpus – AFP part of Gigaword corpus – Xinhua part of Gigaword corpus • lexicalized reordering model with option msd- bidirectional-fe • recaser trained as monotone translation model • weights optimized with max BLEU training We spent about two months (part time) on get- ting our systems in shape for the 2008 NIST evalu- ation. Note that training large systems easily takes 1–2 weeks, mostly due to the slow GIZA++ word alignment stage. While the engineering of a system for such an evaluation mostly involves adapting existing meth- ods on the task at hand, and not the development of new methods, it nevertheless is a crucial stress test and helped us to track down some bugs, most no- tably with the broken MBR decoding and some lan- guage modelling issues. All these improvements are included in the last Moses release. However, some of our latest advances, especially the use of factored translation models (Koehn and Hoang, 2007), randomized language models (Tal- bot and Osborne, 2007), domain adaptation methods (Koehn and Schroeder, 2007), and a better recaser did not make it into the ﬁnal systems due the limited time for experimentation.