Lexical Cohesion for Evaluation of Machine Translation at Document Level Billy T.M. WONG Cecilia F.K. PUN Chunyu KIT Jonathan J. WEBTER Department of Chinese, Translation and Linguistics City University of Hong Kong 83 Tat Chee Avenue, Kowloon, Hong Kong SAR, P R China {tmwong, fungkpun, ctckit, ctjjw}@cityu.edu.hk Abstract—This paper studies how granularity of machine translation evaluation can be extended from sentence to document level. While most state-of-the-art evaluation metrics focus on the sentence level, we emphasize the importance of document structure, showing that lexical cohesion is a critical feature to highlight the superior quality of human translation to machine translation, which uses cohesive devices to tie salient words between sentences together as a text. An experiment shows that this feature can bring forth a 3-5% improvement in the correlation of automatic evaluation results with human judgments of machine translation outputs at the document level. Keywords-machine translation evaluation; evaluation metric; lexical cohesion; text coherence I. INTRODUCTION Machine translation (MT) evaluation has undergone a significant evolution over the past decade from human to automatic assessments. Various evaluation metrics were formulated to quantify the quality of MT outputs, based on the assumption that such quality can be estimated by their textual similarity to corresponding professional human translations as references. MT evaluation has been turned into a task of similarity measurement of MT outputs and reference translations. Typical evaluation metrics include BLEU [1] based on n-gram matching, TERp [2] based on edit-distance, METEOR [3] which utilizes morphology and semantic resources, and ATEC [4] which further exploits features like word informativeness and word ordering. Nearly all evaluation metrics in use so far score MT outputs sentence by sentence. The evaluation result for a text is usually a simple average of its sentence scores. A drawback of this kind of sentence-based evaluation is the neglect of text structure, characterized by the absence of attention to cohesion and coherence within texts. These two linguistic features operate at the inter-sentential level and are realized via the interlinkage of lexical, grammatical and semantic elements across sentences. In the MT evaluation framework from the International Standards of Language Engineering, coherence is defined as “the degree to which the reader can define the role of each individual sentence (or group of sentences) with respect to the text as a whole” [5]. There is no guarantee of this for a text structured by simply putting together well-translated but stand-alone sentences without considering how cohesion and coherence are realized between them. Sentence-based evaluation metrics have no means to distinguish whether a text is cohesive and coherent, and are inevitably prone to falsely over- or underestimate the performance of an MT system. Accurate MT evaluation on text (document) level is particularly important to MT users, for they mainly care about the overall meaning of a text in question rather than the grammatical correctness of each sentence [6]. Accordingly, the evaluation needs to take into account how individual sentences from MT output are joined together into a text. The connectivity of sentences is apparently a significant factor for assessing the understandability of a text as a whole. This does not mean that the evaluation of sentences versus texts is incompatible with each other, rather it simply acknowledges that both intra- and inter-sentence evaluation are important. In MT evaluation, both cohesion and coherence are monolingual features in a target text. They can hardly be evaluated in isolation and have to be conjoined with other quality criteria such as adequacy and fluency. A survey of MT postediting [7] suggests that cohesion and coherence serve as higher level quality criteria beyond many others such as syntactic well-formedness. Posteditors tend to correct syntactic errors first before any amendment for improving cohesion and coherence of an MT output. Also as Wilks [8] noted (as cited in [9]), it is rather unlikely for a sufficiently large sample of translations to be coherent and totally wrong at the same time. Cohesion and coherence are thus possible to serve as the criteria when evaluating the overall quality of MT output. This paper studies the use of a typical type of cohesion, i.e., lexical cohesion, as a potential quality attribute in MT evaluation. We investigate the quantitative variance of lexical cohesion devices in MT output versus human translation, to examine how adequately or inadequately MT systems can handle this feature. We also have devised a manual evaluation method to assess the coherence of MT outputs at the inter- sentence level, as a support to the development of automatic evaluation methods. An experiment to integrate lexical cohesion into a unigram-based MT evaluation metric confirms that this feature can bring about a significant gain in the metric’s performance, in terms of the correlation between evaluation results of evaluation metric and human assessment. The research described in this paper was partially supported by City University of Hong Kong through the SRG grants 7002267 and 7008003 and by the Research Grants Council (RGC) of HKSAR, China, through the GRF grant 9041597.