Proceedings of the Sixth Conference on Machine Translation (WMT), pages 439–445 November 10–11, 2021. ©2021 Association for Computational Linguistics 439 TenTrans Large-Scale Multilingual Machine Translation System for WMT21 Wanying Xie 1,2 Bojie Hu 1 Han Yang 1 Dong Yu 2 Qi Ju 1* 1 TencentMT Oteam, China 2 Beijing Language and Culture University, China xiewanying07@gmail.com, yudong@blcu.edu.cn {bojiehu, sharryyang, damonju}@tencent.com Abstract This paper describes TenTrans large-scale multilingual machine translation system for WMT 2021. We participate in the Small Track 2 in ﬁve South East Asian languages, thirty directions: Javanese, Indonesian, Malay, Tagalog, Tamil, English. We mainly uti- lized forward/back-translation, in-domain data selection, knowledge distillation, and grad- ual ﬁne-tuning from the pre-trained model FLORES-101. We ﬁnd that forward/back- translation signiﬁcantly improves the trans- lation results, data selection and gradual ﬁne-tuning are particularly effective during adapting domain, while knowledge distillation brings slight performance improvement. Also, model averaging is used to further improve the translation performance based on these sys- tems. Our ﬁnal system achieves an average BLEU score of 28.89 across thirty directions on the test set. 1 Introduction We participate in the WMT 2021 large-scale mul- tilingual machine translation task small track 2 in 6 languages: English, Indonesian, Javanese, Malay, Tamil, Tagalog (brieﬂy, En, Id, Jv, Ms, Ta, Tl). Any two of these languages translated into each other produces a total of 30 directions, in- cluding English↔Indonesian, English↔Javanese, English↔Malay, English↔Tamil, English↔ Taga- log, Indonesian↔Javanese, Indonesian↔Malay, Indonesian↔Tamil, Indonesian↔Tagalog, Ja- vanese ↔Malay, Javanese↔Tamil, Javanese↔ Tagalog, Malay↔Tamil, Malay↔Tagalog and Tamil↔Tagalog. To meet the requirements for data restrictions, our systems are all built with con- strained data sets. For all systems, we adopt a universal encoder-decoder architecture that shares * Corresponding author: Qi Ju. Our code, data, and model can be obtained at https://github.com/TenTrans/TenTrans parameters across all languages (Johnson et al., 2017). Our systems are based on several techniques and approaches. We experiment with base and deeper Transformer (Vaswani et al., 2017) architectures to get reliable baselines, ﬁne-tune the pre-training model FLORES-101 (Goyal et al., 2021) to further improve the baseline system. Moreover, we gener- ate pseudo bilingual sentences from the large-scale monolingual data, apply sequence level knowledge distillation (Kim and Rush, 2016) on partial lan- guage pairs, and try a more effectively ﬁne-tuning strategy to domain adaptation (Gu et al., 2021). Par- ticularly in the language pairs with inferior trans- lations, we speciﬁcally improve their performance. All of these technologies have improved our sys- tems, particularly data selection and gradual ﬁne- tuning. We carefully rethought this strategy and found the main gain may come from in-domain knowledge adaptation. This paper was structured as follows: Section 2 describes the data set. Then, we present a detailed overview of our systems in Section 3. The experi- ment settings and main results are shown in Section 4. Finally, we conclude our work in Section 5. 2 Data Prepration We use FLORES-101 SentencePiece (SPM) 1 tok- enizer model with 256K tokens to tokenize bitext and monolingual sentences 2 . Since it is important to clean data strictly (Wang et al., 2018), we follow m2m-100 data preprocessing procedures 3 to ﬁlter bitext data. The rules are as follows: • Remove sentences with more than 50% punc- tuation. 1 https://github.com/google/sentencepiece 2 https://dl.fbaipublicﬁles.com/ﬂores101/pretrained_models/ ﬂores101_mm100_615M.tar.gz 3 https://github.com/pytorch/fairseq/tree/master/examples/ m2m_100