ARTICLE IN PRESS
JID: PATREC [m5G;July 26, 2018;1:51]
Pattern Recognition Letters 000 (2018) 1–8
Contents lists available at ScienceDirect
Pattern Recognition Letters
journal homepage: www.elsevier.com/locate/patrec
Sequence in sequence for video captioning
Huiyun Wang, Chongyang Gao, Yahong Han
∗
School of Computer Science and Technology, Tianjin University, Tianjin 300350, China
a r t i c l e i n f o
Article history:
Available online xxx
Keywords:
Video captioning
Encoding
Decoding
Spatio-temporal representation
a b s t r a c t
For video captioning, the words in the caption are closely related to an overall understanding of the
video. Thus, a suitable representation for the video is rather important for the description. For more
precise words in the task of video captioning, we aim to encode the video feature for current word
at each time-stamp of the generation process. This paper proposes a new framework of ‘Sequence in
Sequence’ to encode the sequential frames into a spatio-temporal representation at each time-stamp to
utter a word and further distill most related visual content by an extra semantic loss. First, we aggregate
the sequential frames to extract related visual content guided by last word, and get a representation with
rich spatio-temporal information. Then, to decode the aggregated representation for a precise word, we
leverage two layers of GRU structure, where the first layer further distills useful visual content based on
an extra semantic loss and the second layer selects the correct word according to the distilled features.
Experiments on two benchmark datasets demonstrate that our method outperforms the current state-of-
the-art methods on Bleu@4, METEOR and CIDEr metrics.
© 2018 Elsevier B.V. All rights reserved.
1. Introduction
Automatically generating a natural language description for the
video, called video captioning, refers to a summarization of the
input video based on visual content understanding. Widespread
applications, e.g., video indexing, human-robot interaction, and
video descriptions for the visually impaired, may benefit much
from good video descriptions. Thus it attracts much attention in
computer vision community [14,31,34,35,37,41]. Due to rich and
open-domain activities in visual content, video captioning re-
mains a challenging task. Inspired by the successful use of the
encoder-decoder framework in machine translation [2,11,30] and
the development of deep learning, most video captioning meth-
ods [13,26,27,38,39] are sequence-to-sequence models based on
the encoder-decoder framework. Particularly, the encoder first uti-
lizes Convolutional Neural Networks (CNN) to extract representa-
tions for static frames and the representations of all frames are
stacked with a Recurrent Neural Networks (RNN) to form the video
representation, and then the decoder utilizes another RNN to gen-
erate natural language descriptions.
To summarize the visual content into a meaningful natural lan-
guage sentence, the captioning model must be able to represent
the sequential frames of the video into a spatio-temporal feature
with rich visual information to express the objects, actions and
∗
Corresponding author.
E-mail address: yahong@tju.edu.cn (Y. Han).
scenes for each word in the generated sentence. To model dynamic
temporal structure of the video, several works [3,34,39,41] first en-
code the representations of the frames from CNN one by one in
sequence before the decoder. Although the method in [34] rep-
resents the global temporal interaction of actors and objects that
evolve over time, it ignores the local temporal structure of the
video. To solve the problem, the method [39] leverages a 3-D
CNN [19,21] to encode the local temporal structure and a tempo-
ral attention mechanism to exploit global temporal structure. And
the model [3] applies the convolution operation to the GRU-RNN
model [2] for preserving the frame spatial topology, and mean-
time catch the temporal information. However, due to the different
speeds of motions in the video, the methods above can only model
temporal information in short videos. The method in [41] proposes
a Multirate Visual Recurrent Model which adopts multiple encod-
ing rates, and thus obtain a multirate representation which is ro-
bust to motion speed variance in videos. To express the static and
dynamic information for sequential words, all the above methods
attempt to obtain a representation of the video which expresses
the spatial and temporal information.
However, in the generation process of the sentence, the models
first encode the video features before the encoder generates the
description of video. Thus, as the input of the decoder, the visual
representation remains the same at each time-stamp, which is un-
reasonable for the generation of the words with different mean-
ing. For example, in Fig. 1 (a), at the time of uttering the word
‘strums’, ‘Sequence to Sequence’ models generate the wrong word
https://doi.org/10.1016/j.patrec.2018.07.024
0167-8655/© 2018 Elsevier B.V. All rights reserved.
Please cite this article as: H. Wang et al., Sequence in sequence for video captioning, Pattern Recognition Letters (2018),
https://doi.org/10.1016/j.patrec.2018.07.024