ARTICLE IN PRESS JID: PATREC [m5G;July 26, 2018;1:51] Pattern Recognition Letters 000 (2018) 1–8 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Sequence in sequence for video captioning Huiyun Wang, Chongyang Gao, Yahong Han School of Computer Science and Technology, Tianjin University, Tianjin 300350, China a r t i c l e i n f o Article history: Available online xxx Keywords: Video captioning Encoding Decoding Spatio-temporal representation a b s t r a c t For video captioning, the words in the caption are closely related to an overall understanding of the video. Thus, a suitable representation for the video is rather important for the description. For more precise words in the task of video captioning, we aim to encode the video feature for current word at each time-stamp of the generation process. This paper proposes a new framework of ‘Sequence in Sequence’ to encode the sequential frames into a spatio-temporal representation at each time-stamp to utter a word and further distill most related visual content by an extra semantic loss. First, we aggregate the sequential frames to extract related visual content guided by last word, and get a representation with rich spatio-temporal information. Then, to decode the aggregated representation for a precise word, we leverage two layers of GRU structure, where the first layer further distills useful visual content based on an extra semantic loss and the second layer selects the correct word according to the distilled features. Experiments on two benchmark datasets demonstrate that our method outperforms the current state-of- the-art methods on Bleu@4, METEOR and CIDEr metrics. © 2018 Elsevier B.V. All rights reserved. 1. Introduction Automatically generating a natural language description for the video, called video captioning, refers to a summarization of the input video based on visual content understanding. Widespread applications, e.g., video indexing, human-robot interaction, and video descriptions for the visually impaired, may benefit much from good video descriptions. Thus it attracts much attention in computer vision community [14,31,34,35,37,41]. Due to rich and open-domain activities in visual content, video captioning re- mains a challenging task. Inspired by the successful use of the encoder-decoder framework in machine translation [2,11,30] and the development of deep learning, most video captioning meth- ods [13,26,27,38,39] are sequence-to-sequence models based on the encoder-decoder framework. Particularly, the encoder first uti- lizes Convolutional Neural Networks (CNN) to extract representa- tions for static frames and the representations of all frames are stacked with a Recurrent Neural Networks (RNN) to form the video representation, and then the decoder utilizes another RNN to gen- erate natural language descriptions. To summarize the visual content into a meaningful natural lan- guage sentence, the captioning model must be able to represent the sequential frames of the video into a spatio-temporal feature with rich visual information to express the objects, actions and Corresponding author. E-mail address: yahong@tju.edu.cn (Y. Han). scenes for each word in the generated sentence. To model dynamic temporal structure of the video, several works [3,34,39,41] first en- code the representations of the frames from CNN one by one in sequence before the decoder. Although the method in [34] rep- resents the global temporal interaction of actors and objects that evolve over time, it ignores the local temporal structure of the video. To solve the problem, the method [39] leverages a 3-D CNN [19,21] to encode the local temporal structure and a tempo- ral attention mechanism to exploit global temporal structure. And the model [3] applies the convolution operation to the GRU-RNN model [2] for preserving the frame spatial topology, and mean- time catch the temporal information. However, due to the different speeds of motions in the video, the methods above can only model temporal information in short videos. The method in [41] proposes a Multirate Visual Recurrent Model which adopts multiple encod- ing rates, and thus obtain a multirate representation which is ro- bust to motion speed variance in videos. To express the static and dynamic information for sequential words, all the above methods attempt to obtain a representation of the video which expresses the spatial and temporal information. However, in the generation process of the sentence, the models first encode the video features before the encoder generates the description of video. Thus, as the input of the decoder, the visual representation remains the same at each time-stamp, which is un- reasonable for the generation of the words with different mean- ing. For example, in Fig. 1 (a), at the time of uttering the word ‘strums’, ‘Sequence to Sequence’ models generate the wrong word https://doi.org/10.1016/j.patrec.2018.07.024 0167-8655/© 2018 Elsevier B.V. All rights reserved. Please cite this article as: H. Wang et al., Sequence in sequence for video captioning, Pattern Recognition Letters (2018), https://doi.org/10.1016/j.patrec.2018.07.024