Many Heads but One Brain: an Overview of Fusion Brain Challenge on AI Journey 2021 Daria Bakshandaeva 1,* , Denis Dimitrov 1,3,* , Alex Shonenkov 1 , Mark Potanin 1 , Vladimir Arkhipkin 1 , Denis Karachev 1 , Vera Davydova 1 , Anton Voronov 2 , Mikhail Martynov 1 , Natalia Semenova 1 , Mikhail Stepnov 1 , Elena Tutubalina 1 , Andrey Chertok 1,2 , Aleksandr Petiushko 2,3 1 Sber AI, 2 Artiﬁcial Intelligence Research Institute, 3 Lomonosov Moscow State University Moscow, Russia {DDBakshandaeva,Dimitrov.D.V,AVShonenkov,potanin.m.st}@sberbank.ru,arkhipkin.v98@gmail.com,denis.karachev@ocrv.ru, VFeDavydova@sberbank.ru,Voronov@airi.net,{mmmartynov,NAlSemenova,mistepnov,EVTutubalina}@sberbank.ru, {Chertok,Petiushko}@airi.net Abstract—Supporting the current trend in the AI community, we propose the AI Journey 2021 Challenge called Fusion Brain which is targeted to make the universal architecture process different modalities (namely, images, texts, and code) and to solve multiple tasks for vision and language. The Fusion Brain Chal- lenge https://github.com/sberbank-ai/fusion brain aij2021 com- bines the following speciﬁc tasks: Code2code Translation, Hand- written Text recognition, Zero-shot Object Detection, and Visual Question Answering. We have created datasets for each task to test the participants’ submissions on it. Moreover, we have opened a new handwritten dataset in both Russian and English, which consists of 94,128 pairs of images and texts. The Russian part of the dataset is the largest Russian handwritten dataset in the world. We also propose the baseline solution and corresponding task-speciﬁc solutions as well as overall metrics. Index Terms—multi-modality, multi-task, bilinguality, founda- tion models, fusion brain challenge I. I NTRODUCTION A signiﬁcant part of the information perceived by a person and required for making even the simplest everyday decisions is presented in multiple modalities, that is, with the help of different types of “input information”, requiring the use of various senses and types of knowledge. Visual informa- tion requires visual perception, processing natural language texts presupposes the knowledge of the language, auditory information implies the perception and analysis of sound, and so on. Each of these modalities is handled by separate, sometimes overlapping areas of machine learning and artiﬁcial intelligence: computer vision, natural language processing, speech processing, video processing, and so on. However, a successful solution to emerging problems often cannot be obtained by analyzing data coming from only one modality, just as it is not always sufﬁcient for a human being to use only sight or only hearing to make a rational decision. In such cases, information required to solve such problems can be divided into several “input types”, called data modalities, all of which should be taken into consideration to make successful decisions. *Both authors contributed equally to this research. Fig. 1. Concept of the multi-modal and multi-task architecture Fusion Brain. The tasks here are C2C – Code2code Translation, HTR – Handwritten Text Recognition, ZsOD - Zero-shot Object Detection, VQA - Visual Question Answering, AEC - Audio Emotion Classiﬁcation, and TEC - Text Emotion Classiﬁcation Multi-task learning has a long history mostly in the natural language processing domain. One of the possible reasons is that having the correct representation and thus “understand- ing” of text passage, one can solve many downstream tasks: sentiment analysis, question answering, language translation etc. One of the most widely used approaches here is to have the lower (encoding) layers shared for all tasks, while having the upper layers (also called “heads”) task-speciﬁc and learned separately [29]. It is only recently that scientists have proposed to combine multi-modality and multi-task in one model, taking the joint approach: using different encoders for different modalities, then combining different types of information during middle processing, and completing the process with task-speciﬁc heads - e.g. the UniT [20] approach, where visual and textual modalities are used, and 7 tasks of computer vision (e.g. object detection), text processing (e.g. sentiment analysis) and vision-and-language (e.g. visual question answering) ﬁelds are solved. The problem of training large pretrained multi-modal and multi-task models can be separated into 2 subtasks: 1) How arXiv:2111.10974v1 [cs.CV] 22 Nov 2021