Vol.:(0123456789)
Machine Translation (2021) 35:265–288
https://doi.org/10.1007/s10590-021-09276-y
1 3
MSVD‑Turkish: a comprehensive multimodal video dataset
for integrated vision and language research in Turkish
Begum Citamak
1
· Ozan Caglayan
3
· Menekse Kuyu
1
· Erkut Erdem
1
·
Aykut Erdem
2
· Pranava Madhyastha
3
· Lucia Specia
3
Received: 15 December 2020 / Accepted: 17 June 2021 / Published online: 1 July 2021
© The Author(s), under exclusive licence to Springer Nature B.V. 2021
Abstract
Automatic generation of video descriptions in natural language, also called video
captioning, aims to understand the visual content of the video and produce a nat-
ural language sentence depicting the objects and actions in the scene. This chal-
lenging integrated vision and language problem, however, has been predominantly
addressed for English. The lack of data and the linguistic properties of other lan-
guages limit the success of existing approaches for such languages. In this paper we
target Turkish, a morphologically rich and agglutinative language that has very dif-
ferent properties compared to English. To do so, we create the frst large-scale video
captioning dataset for this language by carefully translating the English descriptions
of the videos in the MSVD (Microsoft Research Video Description Corpus) dataset
into Turkish. In addition to enabling research in video captioning in Turkish, the
parallel English–Turkish descriptions also enable the study of the role of video con-
text in (multimodal) machine translation. In our experiments, we build models for
both video captioning and multimodal machine translation and investigate the efect
of diferent word segmentation approaches and diferent neural architectures to bet-
ter address the properties of Turkish. We hope that the MSVD-Turkish dataset and
the results reported in this work will lead to better video captioning and multimodal
machine translation models for Turkish and other morphology rich and agglutinative
languages.
Keywords Video description dataset · Turkish · Video captioning · Video
understanding · Neural machine translation · Multimodal machine translation
* Erkut Erdem
erkut@cs.hacettepe.edu.tr
Extended author information available on the last page of the article