Semantically Sensible Video Captioning (SSVC) Md. Mushﬁqur Rahman 1 , Thasin Abedin 1 , Khondokar S. S. Prottoy 1 , Ayana Moshruba 1 , and Fazlul Hasan Siddiqui 2 1 Islamic University of Technology, Gazipur, Bangladesh 2 Dhaka University of Engineering and Technology, Gazipur, Bangladesh Corresponding author: Md. Mushﬁqur Rahman 1 Email address: mushﬁqur11@iut-dhaka.edu ABSTRACT Video captioning, i.e. the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. Generating a semantically accurate description of a video is an arduous task. Considering the complexity of the problem, the results obtained in recent researches are quite outstanding. But still there is plenty of scope for improvement. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise of two sequential/recurrent layers - one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC) which modiﬁes the context generation mechanism by using two novel approaches - “stacked attention” and “spatial hard pull”. For evaluating the proposed architecture, along with the BLEU (Papineni et al., 2002) scoring metric for quantitative analysis, we have used a human evaluation metric for qualitative analysis. This paper refers to this proposed human evaluation metric as the Semantic Sensibility (SS) scoring metric. SS score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of the state-of-the-art architectures. INTRODUCTION After the success of Image Captioning in recent times, researchers have been interested to explore the scope of Video Captioning. Video Captioning is the process of describing a video in a meaningful caption using Natural Language Processing. The core mechanism of Video Captioning is based on the sequence-to-sequence architecture (Gers et al., 2000). In video captioning models the encoder encodes the visual stream and the decoder generates the caption. Such models are capable of retaining both the spatial and temporal information which is essential for generating semantically correct video captions. This requires the video to be split up into a sequence of frames. The model uses these frames as input and generates a series of meaningful words in the form of a caption as output. Video captioning has many applications, for example, interaction between human and machine, aid for people with visual impairments, video indexing, information retrieval, fast video retrieval, etc. Unlike image captioning where only spatial information is required to generate captions, video captioning requires the use of a mechanism that combines spatial information with temporal information to store both the higher level and the lower level features to generate semantically sensible captions. Even though the progress has been rapid, there are scopes to work on the existing complexities in video captioning task. One of the main challenges is the ability to extract high level features from videos to generate a more meaningful caption, for which we come up with a solution. In our paper we have come up with a novel architecture that is based on the work of Venugopalan et al. (2015). It uses the combination of two novel methods - a variation of dual-attention (Nam et al., 2017), namely, Stacked Attention and a novel method, namely, Spatial Hard Pull. In the encoding side, we use a stacked sequential encoder having two bi-directional LSTM layers. Stacked Attention network sets priority to the object in the video layer-by-layer. To overcome the redundancy of similar information arXiv:2009.07335v2 [cs.CV] 30 Oct 2020