Describing Videos using Multi-modal Fusion Qin Jin † , Jia Chen \ , Shizhe Chen † , Yifan Xiong † , Alexander Hauptmann \ † School of Information, Renmin University of China, China {qjin, cszhe1, xiongyf}@ruc.edu.cn \ Language Technologies Institute, Carnegie Mellon University, USA {jiac, alex}@cs.cmu.edu ABSTRACT Describing videos with natural language is one of the ul- timate goals of video understanding. Video records multi- modal information including image, motion, aural, speech and so on. MSR Video to Language Challenge provides a good chance to study multi-modality fusion in caption task. In this paper, we propose the multi-modal fusion encoder and integrate it with text sequence decoder into an end-to- end video caption framework. Features from visual, aural, speech and meta modalities are fused together to represen- t the video contents. Long Short-Term Memory Recurrent Neural Networks (LSTM-RNNs) are then used as the de- coder to generate natural language sentences. Experimental results show the eﬀectiveness of multi-modal fusion encoder trained in the end-to-end framework, which achieved top performance in both common metrics evaluation and human evaluation. Keywords video description generation; multi-modal fusion; end-to-end framework 1. INTRODUCTION It is an intriguing challenge to automatically describe videos containing complex and diverse contents with natural lan- guage sentences. It has a wide range of applications such as assisting blind people or improving search quality for online videos. Inspired by the recent success of image captioning [1, 2], where natural language sentences are generated to describe image content, researchers have been paying more attention to the generation of video captions. Diﬀerent from image caption, generating video descrip- tions encounters two additional challenges: Firstly, video contains temporal information. Semantic concepts involved in the video may evolve over time. Secondly, video con- sists of multi-modalities. Besides visual information, videos also contain other contents such as aural and speech modal- ities, which provide additional information. The diversity of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or commercial advantage and that copies bear this notice and the full cita- tion on the ﬁrst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission and/or a fee. Request permissions from permissions@acm.org. MM ’16, October 15-19, 2016, Amsterdam, Netherlands c  2016 ACM. ISBN 978-1-4503-3603-1/16/10. . . $15.00 DOI: http://dx.doi.org/10.1145/2964284.2984065 groundtruth captions also reﬂects information from multiple modalities. Many research eﬀorts have been made to address the ﬁrst aforementioned challenge in video description generation. Venugopalan et al. [3] use the LSTM to encode the frames in the video to a ﬁxed length vector and learn the encoder in the end-to-end framework. Yao et al. [4] exploit the lo- cal temporal structure underlying the video. They use the spatio-temporal convolution neural network (3-D CNN) as action features to encode local temporal structure and tem- poral attention mechanism based on soft-alignment method to exploit global temporal structure. Pan et al. [5] use a hierarchical recurrent encoder to encode frames in the video to a ﬁxed length and learn the encoder in the end-to-end framework. There are relatively few works focusing on the second challenge of the multi-modality issue. Jin et al. [6] investigate to combine acoustic and visual information at the video representation level to generate video descriptions and achieve signiﬁcant improvement over visual-only baseline. As the decoder of the video caption generation, LSTM is widely used in previous works [3, 4, 5, 6, 7]. Pan et al. [7] improve the optimization goal for the decoder. The coher- ence loss is to locally maximize the next word given previous words and visual content and the relevance loss aims to en- force the relationship between the semantic of the sentence and visual content by creating a visual-semantic embedding space. They joint optimize the two losses in a uniﬁed model. Yu et al. [8] exploit a hierarchical-RNN framework includ- ing a sentence generator and a paragraph generator as the decoder. The framework models inter-sentence dependency which enables it to generate a paragraph for a long video. In this paper, we tackle the video description generation problem in the context of MSR Video to Language Chal- lenge. We mainly focus on utilizing multi-modal features from visual, audio, speech and meta-data information to improve the description performance. We propose a multi- modal fusion encoder and combine it with the text sequence decoder in an end-to-end framework. We address the fol- lowing two questions in the challenge: 1. How much performance gain can additional modalities bring to visual-only systems? 2. What speciﬁc categories can beneﬁt from diﬀerent modal- ities? We examine these questions empirically by evaluating our framework on diﬀerent feature combinations. Experimental results show that combining multi-modal cues can signif- icantly improve the description performance and generate more semantically accurate and comprehensive sentences. 1087