S IGN L ANGUAGE T RANSLATION WITH T RANSFORMERS Kayo Yin École Polytechnique kayo.yin@polytechnique.edu ABSTRACT Sign Language Translation (SLT) first uses a Sign Language Recognition (SLR) system to extract sign lan- guage glosses from videos. Then, a translation system generates spoken language translations from the sign language glosses. Though SLT has gathered interest re- cently, little study has been performed on the translation system. This paper focuses on the translation system and improves performance by utilizing Transformer net- works. We report a wide range of experimental results for various Transformer setups and introduce the use of Spatial-Temporal Multi-Cue (STMC) networks in an end-to-end SLT system with Transformer. We perform experiments on RWTH-PHOENIX- Weather 2014T, a challenging SLT benchmark dataset of German sign language, and ASLG-PC12, a dataset involving American Sign Language (ASL) recently used in gloss-to-text translation. Our methodology improves on the current state-of-the-art by over 5 and 7 points respectively in BLEU-4 score on ground truth glosses and by using a STMC network to predict glosses of the RWTH-PHOENIX-Weather 2014T dataset. On the ASLG-PC12 corpus, we report an improvement of over 16 points in BLEU-4. Our findings also demonstrate that end-to-end translation on predicted glosses provides even better performance than translation on ground truth glosses, which shows potential for further improvement in SLT by jointly training both the SLR and translation systems or revising the gloss annotation system. 1 Introduction Communication holds a central position in our daily lives and social interactions. Yet, in a predominantly hearing society, the hearing-impaired is often deprived from effective communication. Although the deaf com- munity in various cultures have developed sign language to handily communicate between themselves and oth- ers who have learned to sign, it remains uncommon for hearing people to have learned sign language. While ad- vancements have been made in recent years to better ac- commodate deaf people, such as the captioning of videos and increased use of online text-based communication, the deaf community still face issues of social isolation and miscommunication on a daily basis [55, 49, 14, 63]. Although Sign Language Recognition (SLR) has been an active topic of research over the last two decades [12, 33, 34, 8, 65] it is only in recent years that Sign Language Translation (SLT) has gathered some interest and advancement [9, 32]. For the rest of this paper, we will refer to SLT as the task translating sign language into spoken language, and will precise the cases in the other direction. In general, sign languages have developed indepen- dently of their spoken counterparts, and learning to sign is not easier than learning a completely different spo- ken language. There is a significant linguistic variance between spoken and sign languages [59], where sign language usually does not translate its spoken counter- part word by word. For instance, the syntax of ASL shares more with spoken Japanese than English [42]. For this reason, SLR systems do not suffice in capturing the underlying grammar and complexities of sign lan- guage, and SLT faces the additional challenge in gener- ating translations while taking into account the different syntatic structure and grammar. In this paper, we build upon the approach formalized in [9] for SLT that can be divided into two parts: tok- enization and translation. The tokenization problem is similar to continuous SLR, where vision methods ana- lyze videos of sign language to generate sign language glosses that capture the meaning of the sequence of dif- ferent signs. The translation problem is analogous to any translation task between two different languages if we regard the sign language glosses as one language. Recent works [43, 69] have reported improvements in the tokenization system, but currently there has been no study on improving the translation system for this SLT task. We utilize and compare different Neural Machine Translation (NMT) architectures, notably Transformers that have not yet been studied in [9], in the context of SLT. To evaluate the performance of our NMT approach and compare with existing works, we perform transla-