Statistical Machine Translation
for Bilingually Low-Resource Scenarios:
A Round-Tripping Approach
Benyamin Ahmadnia
Autonomous University of Barcelona
Cerdanyola del Valles, Spain
benyamin.ahmadnia@uab.cat
Gholamreza Haffari
Monash University
Clayton, VIC, Australia
gholamreza.haffari@monash.edu
Javier Serrano
Autonomous University of Barcelona
Cerdanyola del Valles, Spain
javier.serrano@uab.cat
Abstract—In this paper we apply the round-tripping algorithm
to Statistical Machine Translation (SMT) for making effective
use of monolingual data to tackle the training data scarcity.
In this approach, the outbound-trip (forward) and inbound-trip
(backward) translation tasks make a closed loop, and produce
informative feedback to train the translation models. Based on
this produced feedback we iteratively update the forward and
backward translation models. The experimental results show that
translation quality is improved for Persian↔Spanish translation
task.
Index Terms—natural language processing, statistical machine
translation, low-resource languages, round-tripping algorithm
I. I NTRODUCTION
Statistical Machine Translation (SMT) is an approach to MT
that is characterized by the use of machine learning methods.
The goal of SMT is to translate a source language sentences
into a target language by assessing the plausibility of the
source and the target sentences in relation to existing bodies
of translation between the two languages.
SMT systems usually rely on the aligned parallel training
corpora, and the presence of sizable bodies of aligned parallel
training corpora. However, collecting such parallel data in
practice is very expensive. As such, parallel bilingually data
is limited for many language pairs, e.g. Persian-Spanish.
Assuming the availability of large amounts of monolingual
data, it is natural to leverage them to boost the performance
of SMT systems. Different methods have been proposed for
this purpose, which can be roughly categorized into two
categories. In the first category, monolingual corpora in the
target language are used to train a language model, which is
then integrated with the SMT models trained from parallel
bilingual corpora to improve the translation quality. In the
second category, pseudo bilingual sentence pairs are gener-
ated from monolingual data by using the translation model
trained from aligned parallel corpora, and then these pseudo
bilingual sentence pairs are used to enlarge the training data
for subsequent learning iterations.
While the aforementioned methods could improve the SMT
performance to some extent, they still suffer from certain
limitations. The methods in the first category only use the
monolingual data to train language models, but do not fun-
damentally address the shortage of parallel training data.
Although the methods in the second category can enlarge the
parallel training data, there is no guarantee on the quality of
the pseudo bilingual sentence pairs.
The round-tripping algorithm is inspired by the observa-
tion that there are two related translation tasks: source-to-
target direction (outbound-trip), and target-to-source direction
(inbound-trip). There are certain significant traits of these
outbound/inbound-trips; they can form a closed loop, and
they can help in generating informative feedback so that the
translation models are trained simultaneously.
In the case of Persian↔Spanish minimal parallel-resource
language pair, we apply the round-trip training algorithm to
leverage monolingual data in a more effective way. By using
this algorithm, the monolingual data can play a similar role to
the parallel bilingual data, and significantly reduce the require-
ment on parallel bilingual data. More specifically, each model
provides guidance to the other throughout the learning process.
The two models are iteratively updated until convergence,
throughout this co-training style learning algorithm [1].
II. LANGUAGE I SSUE
SMT has proven to be successful for a number of lan-
guage pairs. However, as soon as the Persian language is
involved with any sort of MT, a number of difficulties are
encountered. Persian suffers significantly from the shortage of
digitally available text, both parallel and monolingual. Persian
is morphologically rich, with many characteristics not shared
by other languages. It makes no use of articles (”a”, ”an”,
”the”), there is no distinction between capital and lower-
case letters, and symbols and abbreviations are rarely used.
Sentence structure is also different from that of English.
Persian places parts of speeches such as nouns, subjects,
adverbs and verbs in different locations in the sentence, and
sometime even omitting them altogether. Some Persian words
have many different versions spellings, and it is not uncommon
for translators to invent new words. This can result in Out-Of-
Vocabulary (OOV) words.
261
978-1-5386-4385-3/18/$31.00 ©2018 IEEE