Mach Translat (2012) 26:159–176 DOI 10.1007/s10590-011-9105-x Evaluation of 2-way Iraqi Arabic–English speech translation systems using automated metrics Sherri Condon · Mark Arehart · Dan Parvaz · Gregory Sanders · Christy Doran · John Aberdeen Received: 12 July 2010 / Accepted: 9 August 2011 / Published online: 22 September 2011 © Springer Science+Business Media B.V. 2011 Abstract The Defense Advanced Research Projects Agency (DARPA) Spoken Language Communication and Translation System for Tactical Use (TRANSTAC) program (http://1.usa.gov/transtac) faced many challenges in applying automated measures of translation quality to Iraqi Arabic–English speech translation dialogues. Features of speech data in general and of Iraqi Arabic data in particular undermine basic assumptions of automated measures that depend on matching system outputs to refer- ence translations. These features are described along with the challenges they present for evaluating machine translation quality using automated metrics. We show that scores for translation into Iraqi Arabic exhibit higher correlations with human judg- ments when they are computed from normalized system outputs and reference trans- lations. Orthographic normalization, lexical normalization, and operations involving light stemming resulted in higher correlations with human judgments. Approved for Public Release: 11-0118. Distribution Unlimited. The views expressed are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S. Government. Some of the material in this article was originally presented at the Language Resources and Evaluation Conference (LREC) 2008 in Marrakesh, Morocco and at the 2009 MT Summit XII in Ottawa, Canada. S. Condon (B ) · M. Arehart The MITRE Corporation, McLean, VA, USA e-mail: scondon@mitre.org D. Parvaz The MITRE Corporation, Orlando, FL, USA G. Sanders National Institute of Standards and Technology, Gaithersburg, MD, USA C. Doran · J. Aberdeen The MITRE Corporation, Bedford, MA, USA 123