Error profiling for evaluation of machine-translated text: a Polish-English case study Sandra Weiss 1 , Lars Ahrenberg 2 Department of Culture and Communication 1 / Department of Computer and Information Science 2 Linköping University E-mail: sandre17@gmail.com, lars.ahrenberg@liu.se Abstract We present a study of Polish-English machine translation, where the impact of various types of errors on cohesion and comprehensibility of the translations were investigated. The following phenomena are in focus: (i) The most common errors produced by current state-of-the-art MT systems for Polish-English MT. (ii) The effect of different types of errors on text cohesion. (iii) The effect of different types of errors on readers’ understanding of the translation. We found that errors of incorrect and missing translations are the most common for current systems, while the category of non-translated words had the most negative impact on comprehension. All three of these categories contributed to the breaking of cohesive chains. The correlation between number of errors found in a translation and number of wrong answers in the comprehension tests was low. Another result was that non-native speakers of English performed at least as good as native speakers on the comprehension tests. Keywords: Machine translation evaluation, Error analysis, Polish-English machine translation. 1. Introduction Nowadays translation is not only a profession but an everyday activity. For our convenience, since quite a while now, there are many translating tools available which can be used instantly on the internet and help us get access to information written in a language that we do not understand. In this study we wished to gauge the performance of those systems, restricted to the language pair Polish-English. The focus of the study is on the text quality they produce and the effect of errors on text cohesion and readers’ comprehension. Automatic metrics for machine translation output such as BLEU, NIST and METEOR have benefitted the development and comparison of machine translation systems tremendously. They are not without drawbacks, however. They are hard to interpret in qualitative terms and they are not really fit for task-based evaluation, as they are defined and applied independently of the intended use of the system. While some of the metrics have parameters that can be set differently, e.g. giving different weights to different n-gram lengths, they are based on comparisons with reference translations, for which the purpose and quality characteristics are not usually known or seen as irrelevant. For assimilative translation, where the goal is to provide a translation that is good enough to enable a user with little knowledge of the source language, to gain a correct understanding of the contents of the source text, it is hard to avoid using human subjects in the evaluations. It is also of interest, however, to know what features of a translation may cause comprehension problems. Therefore, occurrences of different types of error were investigated, and, as we hypothesized that comprehension problems may correlate with a lack of text cohesion, we also investigated the effect of errors on the cohesion of the translations and observed difficulties of comprehension. More specifically, we were interested in the following questions: What are the most common errors produced by current state-of-the-art MT systems for Polish-English MT? What is the effect of various types of errors on text cohesion? What is the effect of various types of errors on readers’ understanding of the translation? Are there differences between native and non-native speakers in their ability to comprehend machine-translated text? Some recent studies indicate that qualitative evaluations that employ error categories can be at least partly automated (e.g. Xiong et al., 2010; Popović and Burchardt, 2011). In this study, however, the tasks of recognizing and categorizing errors have been performed by one of the authors. The outline of the paper is as follows. In the next section we describe related work. In Section 3 we describe our method and the data used. In Section 4 we state the most important results which is followed by a discussion in Section 5. Finally, in Section 6 we state the conclusions. 2. Related work Many different techniques have been available to evaluate MT output. Initially accepted measures of MT evaluation have included examination of MT system output by humans, who grade the correctness of the 1764