Improved Statistical Translation Through Editing Chris Callison-Burch Colin Bannard University of Edinburgh 2 Buccleuch Place Edinburgh EH8 9LW {chris,colin}@linearb.co.uk Josh Schroeder Linear B Ltd. 39 B Cumberland Street Edinburgh EH3 6RA josh@linearb.co.uk Abstract In this paper we introduce Linear B’s sta- tistical machine translation system. We describe how Linear B’s phrase-based translation models are learned from a par- allel corpus, and show how the quality of the translations produced by our system can be improved over time through edit- ing. There are two levels at which our translations can be edited. The first is through a simple correction of the text that is produced by our system. The second is through a mechanism which allows an ad- vanced user to examine the sentences that a particular translation was learned from. The learning process can be improved by correcting which phrases in the sentence should be considered translations of each other. 1 Introduction Statistical machine translation was first proposed in Brown et al. (1988). Since statistical machine trans- lation systems are created by automatically analyz- ing a corpus of example translations they have a number of advantages over systems that are built us- ing more traditional approaches to MT: • They make few linguistic assumptions and can therefore be applied to nearly any language pair, given a sufficiently large corpus. • They can be developed in a matter of weeks or days, whereas systems that are hand-crafted by linguists and lexicographers can take years. • They can be improved with little additional ef- fort as more data becomes available. More recent advances in phrase-based approaches to statistical translation (Koehn et al., 2003; Marcu and Wong, 2002; Och et al., 1999) have led to a dramatic increase in the quality of the translation systems. Phrase-based translation systems produce higher-quality translation since they use longer seg- ments of human translated text. Using longer seg- ments of human translated text reduces problems as- sociated with literal word-for-word translations. For example, multi-word expressions such as idioms are better translated. Linear B is a commercial provider of statistical machine translation systems. This paper describes Linear B’s advances to phrase-based machine trans- lation that allow translation quality to be improved through editing translations that are produced by our system. There are two levels at which our transla- tions can be edited: • The first is through a simple correction of the text that is produced by our system. Our sys- tem improves by dynamically learning the cor- rect translations of new phrases. These new phrases are extracted from the corrected sen- tence pair using the existing translation models, and can be used immediately for subsequent translations. • The second is through a mechanism that allows an advanced user to inspect which phrases the