Statistical Machine Translation for Bilingually Low-Resource Scenarios: A Round-Tripping Approach Benyamin Ahmadnia Autonomous University of Barcelona Cerdanyola del Valles, Spain benyamin.ahmadnia@uab.cat Gholamreza Haffari Monash University Clayton, VIC, Australia gholamreza.haffari@monash.edu Javier Serrano Autonomous University of Barcelona Cerdanyola del Valles, Spain javier.serrano@uab.cat Abstract—In this paper we apply the round-tripping algorithm to Statistical Machine Translation (SMT) for making effective use of monolingual data to tackle the training data scarcity. In this approach, the outbound-trip (forward) and inbound-trip (backward) translation tasks make a closed loop, and produce informative feedback to train the translation models. Based on this produced feedback we iteratively update the forward and backward translation models. The experimental results show that translation quality is improved for Persian↔Spanish translation task. Index Terms—natural language processing, statistical machine translation, low-resource languages, round-tripping algorithm I. I NTRODUCTION Statistical Machine Translation (SMT) is an approach to MT that is characterized by the use of machine learning methods. The goal of SMT is to translate a source language sentences into a target language by assessing the plausibility of the source and the target sentences in relation to existing bodies of translation between the two languages. SMT systems usually rely on the aligned parallel training corpora, and the presence of sizable bodies of aligned parallel training corpora. However, collecting such parallel data in practice is very expensive. As such, parallel bilingually data is limited for many language pairs, e.g. Persian-Spanish. Assuming the availability of large amounts of monolingual data, it is natural to leverage them to boost the performance of SMT systems. Different methods have been proposed for this purpose, which can be roughly categorized into two categories. In the ﬁrst category, monolingual corpora in the target language are used to train a language model, which is then integrated with the SMT models trained from parallel bilingual corpora to improve the translation quality. In the second category, pseudo bilingual sentence pairs are gener- ated from monolingual data by using the translation model trained from aligned parallel corpora, and then these pseudo bilingual sentence pairs are used to enlarge the training data for subsequent learning iterations. While the aforementioned methods could improve the SMT performance to some extent, they still suffer from certain limitations. The methods in the ﬁrst category only use the monolingual data to train language models, but do not fun- damentally address the shortage of parallel training data. Although the methods in the second category can enlarge the parallel training data, there is no guarantee on the quality of the pseudo bilingual sentence pairs. The round-tripping algorithm is inspired by the observa- tion that there are two related translation tasks: source-to- target direction (outbound-trip), and target-to-source direction (inbound-trip). There are certain signiﬁcant traits of these outbound/inbound-trips; they can form a closed loop, and they can help in generating informative feedback so that the translation models are trained simultaneously. In the case of Persian↔Spanish minimal parallel-resource language pair, we apply the round-trip training algorithm to leverage monolingual data in a more effective way. By using this algorithm, the monolingual data can play a similar role to the parallel bilingual data, and signiﬁcantly reduce the require- ment on parallel bilingual data. More speciﬁcally, each model provides guidance to the other throughout the learning process. The two models are iteratively updated until convergence, throughout this co-training style learning algorithm [1]. II. LANGUAGE I SSUE SMT has proven to be successful for a number of lan- guage pairs. However, as soon as the Persian language is involved with any sort of MT, a number of difﬁculties are encountered. Persian suffers signiﬁcantly from the shortage of digitally available text, both parallel and monolingual. Persian is morphologically rich, with many characteristics not shared by other languages. It makes no use of articles (”a”, ”an”, ”the”), there is no distinction between capital and lower- case letters, and symbols and abbreviations are rarely used. Sentence structure is also different from that of English. Persian places parts of speeches such as nouns, subjects, adverbs and verbs in different locations in the sentence, and sometime even omitting them altogether. Some Persian words have many different versions spellings, and it is not uncommon for translators to invent new words. This can result in Out-Of- Vocabulary (OOV) words. 261 978-1-5386-4385-3/18/$31.00 ©2018 IEEE