3D Attention is All You Need Ashutosh Baghel Chaitanya Dwivedi Prashasti Sar Sanjana S. Mallya January 7, 2020 Abstract Visual question answering is a task that requires interaction of two different modalities. While image question answering calls for spatial context, video question answering takes this a step further and requires temporal attention across video frames. Due to the complex interplay of two different contexts with two independent modalities, there has been relatively less work in the video question answering domain. The following project uses attention across video frames in conjunction with a transformer based cross modal encoder architecture to handle the task at hand. 1 Introduction Understanding visual contents, at human-level is the holy grail in visual intelli- gence. To this end, there has been a tremendous amount of research to solve visual question answering. However, this mostly involves image input, and the video domain still is still nascent. Among other reasons, the additional temporal aspect makes this a challenging task. Simultaneously, transformer networks are pushing the state of art in different language tasks. However, they remain relatively less explored in the video ques- tion answering domain. Given that transformer networks enhance performance in tasks where sequential data is available, we expect to observe a commensurate improvement in video question answering as well. Through this project, we have developed a network that follows this intuition. 2 Related Work The baseline implementation [1] which we are trying to expand on uses a spatio- temporal Visual Question Answering model which takes in as input a question and a video and gives as output a single word or a vector of compatibility scores as answer candidates for MCQs. The architecture comprises of a dual layer video encoder and a text encoder. The idea of using attention as a mechanism of relating different positions of a single sequence in order to compute a representation of the sequence was introduced in [5]. Since then, the Transformer model has driven the Sequence 1