Polyglot Machine Translation Luis A. Leiva and Vicent Alabau ∗∗ Sciling, CPI UPV, 46022Valencia (Spain) Abstract. Machine Translation (MT) requires a large amount of linguistic resources, which leads current MT systems to leaving unknown words untranslated. This can be annoying for end users, as they might not understand at all such untranslated words. However, most language families share a common vocabulary, therefore this knowledge can be leveraged to produce more understandable translations, typically for “assimilation” or gisting use. Based on this observation, we propose a method that constructs polyglot translations tailored to a particular user language. Simply put, an unknown word is translated into a set of languages that relate to the user’s language, and the translated word that is closest to the user’s language is used as a replacement of the unknown word. Experimental results on language coverage over three language families indicate that our method may improve the usefulness of MT systems. As confirmed by a subsequent human evaluation, polyglot translations look indeed familiar to the users, and are perceived to be easier to read and understand than translations in their related natural languages. Keywords: Minority Languages; Machine Translation; Linguistic Coverage; Vocabulary; Human factors 1. Introduction and Related Work In an ideal world, the diversity of languages would not be an obstacle to the transmission of knowledge and culture. In order to enable communication between peo- ple separated by language barriers, computers are in- creasingly being used to automatically convert a source language into a target language, with machine trans- lation (MT) technology. Maybe computers will never fully replace human translators, but MT is by far more scalable than manual translation for “assimilation” or gisting applications, since MT can automate and con- siderably speed up this task. Further, for many pairs of languages, even human translators do not exist [3]. However, only 10% of the current languages world- wide are currently covered by MT technologies [13]. The reason for such a low coverage is that MT systems adopt either rule-based or data-driven approaches (or a combination of both) to the translation task, which require fairly large collections of language resources. 1 This means that we can expect MT to work well for the more widely-spoken languages, while for other, less- spoken languages, the chances of successful implemen- tation are more remote... Or can MT systems be adapted to support any language? * Both are corresponding authors (name@sciling.com). ** Work conducted while both authors were affiliated with the Universitat Polit` ecnica de Val` encia. According to Ethnologue [14], around half of the 7,105 living languages worldwide have a developed writing system, all of them being considered minor- ity languages or, from a natural language process- ing perspective, under-resourced or “noncentral” lan- guages [20]. In theory, MT systems could be deployed for all of them, but in practice the lack of resources available for most of these languages would render any such system largely unusable, since much of the text would be left untranslated. What is more, resources vary greatly even for the 10% most popular languages; and, given their enormous rate of growth and state of contin- uous evolution [16], even the best-equipped languages cannot be covered in their entirety by MT systems. At best, poor language coverage leads to what is known as the out-of-vocabulary (OOV) words problem. Current MT systems usually respond to this occurrence by leaving unknown words untranslated. This is rather problematic for two main reasons: firstly, untranslated words may be of paramount importance to the under- lying meaning of a sentence or even a paragraph, so the message can be lost; secondly, when the source lan- guage is unrelated to the user’s primary (reading) lan- guage, these untranslated words are often completely undecipherable. Consequently, in the extreme case of there being no resources available for a given source language, MT systems simply cannot be built and the automatic translation of these languages becomes a near impossible task. c IOS Press and the authors. The final publication is available at IOS Press through http://dx.doi.org/10.3233/JIFS-152533 This is a preprint for personal use only. The published paper may be subject to some form of copyright.