Proceedings of the Sixth Conference on Machine Translation (WMT), pages 439–445
November 10–11, 2021. ©2021 Association for Computational Linguistics
439
TenTrans Large-Scale Multilingual Machine Translation System for
WMT21
Wanying Xie
1,2
Bojie Hu
1
Han Yang
1
Dong Yu
2
Qi Ju
1*
1
TencentMT Oteam, China
2
Beijing Language and Culture University, China
xiewanying07@gmail.com, yudong@blcu.edu.cn
{bojiehu, sharryyang, damonju}@tencent.com
Abstract
This paper describes TenTrans large-scale
multilingual machine translation system for
WMT 2021. We participate in the Small
Track 2 in five South East Asian languages,
thirty directions: Javanese, Indonesian, Malay,
Tagalog, Tamil, English. We mainly uti-
lized forward/back-translation, in-domain data
selection, knowledge distillation, and grad-
ual fine-tuning from the pre-trained model
FLORES-101. We find that forward/back-
translation significantly improves the trans-
lation results, data selection and gradual
fine-tuning are particularly effective during
adapting domain, while knowledge distillation
brings slight performance improvement. Also,
model averaging is used to further improve
the translation performance based on these sys-
tems. Our final system achieves an average
BLEU score of 28.89 across thirty directions
on the test set.
1 Introduction
We participate in the WMT 2021 large-scale mul-
tilingual machine translation task small track 2
in 6 languages: English, Indonesian, Javanese,
Malay, Tamil, Tagalog (briefly, En, Id, Jv, Ms, Ta,
Tl). Any two of these languages translated into
each other produces a total of 30 directions, in-
cluding English↔Indonesian, English↔Javanese,
English↔Malay, English↔Tamil, English↔ Taga-
log, Indonesian↔Javanese, Indonesian↔Malay,
Indonesian↔Tamil, Indonesian↔Tagalog, Ja-
vanese ↔Malay, Javanese↔Tamil, Javanese↔
Tagalog, Malay↔Tamil, Malay↔Tagalog and
Tamil↔Tagalog. To meet the requirements for
data restrictions, our systems are all built with con-
strained data sets. For all systems, we adopt a
universal encoder-decoder architecture that shares
*
Corresponding author: Qi Ju.
Our code, data, and model can be obtained at
https://github.com/TenTrans/TenTrans
parameters across all languages (Johnson et al.,
2017).
Our systems are based on several techniques and
approaches. We experiment with base and deeper
Transformer (Vaswani et al., 2017) architectures
to get reliable baselines, fine-tune the pre-training
model FLORES-101 (Goyal et al., 2021) to further
improve the baseline system. Moreover, we gener-
ate pseudo bilingual sentences from the large-scale
monolingual data, apply sequence level knowledge
distillation (Kim and Rush, 2016) on partial lan-
guage pairs, and try a more effectively fine-tuning
strategy to domain adaptation (Gu et al., 2021). Par-
ticularly in the language pairs with inferior trans-
lations, we specifically improve their performance.
All of these technologies have improved our sys-
tems, particularly data selection and gradual fine-
tuning. We carefully rethought this strategy and
found the main gain may come from in-domain
knowledge adaptation.
This paper was structured as follows: Section 2
describes the data set. Then, we present a detailed
overview of our systems in Section 3. The experi-
ment settings and main results are shown in Section
4. Finally, we conclude our work in Section 5.
2 Data Prepration
We use FLORES-101 SentencePiece (SPM)
1
tok-
enizer model with 256K tokens to tokenize bitext
and monolingual sentences
2
. Since it is important
to clean data strictly (Wang et al., 2018), we follow
m2m-100 data preprocessing procedures
3
to filter
bitext data. The rules are as follows:
• Remove sentences with more than 50% punc-
tuation.
1
https://github.com/google/sentencepiece
2
https://dl.fbaipublicfiles.com/flores101/pretrained_models/
flores101_mm100_615M.tar.gz
3
https://github.com/pytorch/fairseq/tree/master/examples/
m2m_100