Artpedia: A New Visual-Semantic Dataset with Visual and Contextual Sentences in the Artistic Domain Matteo Stefanini [0000−0001−6153−926X] , Marcella Cornia [0000−0001−9640−9385] , Lorenzo Baraldi [0000−0001−5125−4957] , Massimiliano Corsini [0000−0003−0543−1638] , and Rita Cucchiara [0000−0002−2239−283X] University of Modena and Reggio Emilia, Modena, Italy {name.surname}@unimore.it Abstract. As vision and language techniques are widely applied to realistic im- ages, there is a growing interest in designing visual-semantic models suitable for more complex and challenging scenarios. In this paper, we address the problem of cross-modal retrieval of images and sentences coming from the artistic domain. To this aim, we collect and manually annotate the Artpedia dataset that contains paintings and textual sentences describing both the visual content of the paintings and other contextual information. Thus, the problem is not only to match images and sentences, but also to identify which sentences actually describe the visual content of a given image. To this end, we devise a visual-semantic model that jointly addresses these two challenges by exploiting the latent alignment between visual and textual chunks. Experimental evaluations, obtained by comparing our model to different baselines, demonstrate the effectiveness of our solution and highlight the challenges of the proposed dataset. The Artpedia dataset is publicly available at: http://aimagelab.ing.unimore.it/artpedia. Keywords: Cross-modal retrieval · Visual-semantic models · Cultural Heritage. 1 Introduction The integration of vision and language has recently gained a lot of attention from both computer vision and NLP communities. As humans, we can seamlessly connect what we visually see or imagine and what we hear or say, therefore building effective bridges between our ability to see and our ability to express ourselves in a common language. In the effort of artiﬁcially replicating these connections, new algorithms and architectures have recently emerged for image and video captioning [1,16,5] and for visual-semantic retrieval [13,7,15]. The former architectures combine vision and language in a gener- ative ﬂavour on the textual side, and in the latter common spaces are built to integrate the two domains and retrieve textual elements given visual queries, and vice versa. While the standard objective in visual-semantic retrieval is that of associating im- ages and visual sentences (i.e. sentences that visually describe something), the variety of sentences which can be found in textual corpora is deﬁnitely larger, and also con- tains sentences which do not describe the visual content of a scene. Here, we go a step beyond and extend the task of visual-semantic retrieval to a setting in which the textual