Citation: Jabbar, M.S.; Shin, J.; Cho,
J.-D. AI Ekphrasis: Multi-Modal
Learning with Foundation Models
for Fine-Grained Poetry Retrieval.
Electronics 2022, 11, 1275. https://
doi.org/10.3390/electronics11081275
Academic Editor: George A.
Tsihrintzis
Received: 12 March 2022
Accepted: 14 April 2022
Published: 18 April 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
electronics
Article
AI Ekphrasis: Multi-Modal Learning with Foundation Models
for Fine-Grained Poetry Retrieval
Muhammad Shahid Jabbar
1
, Jitae Shin
1,
* and Jun-Dong Cho
1,2,
*
1
Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, Korea;
eeshahid@skku.edu
2
Department of Human ICT Convergence, Sungkyunkwan University, Suwon 16419, Korea
* Correspondence: jtshin@skku.edu (J.S.); jdcho@skku.edu (J.-D.C.)
Abstract: Artificial intelligence research in natural language processing in the context of poetry
struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other
fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval
remain largely unexplored. Our recent accessibility study indicates that poetry is an effective
medium to convey visual artwork attributes for improved artwork appreciation of people with
visual impairments. We, therefore, introduce a deep learning approach for the automatic retrieval of
poetry suitable to the input images. The recent state-of-the-art CLIP provides a way for multi-modal
visual and text features matched using cosine similarity. However, it lacks shared cross-modality
attention features to model fine-grained relationships. The proposed approach in this work takes
advantage of strong pre-training of the CLIP model and overcomes its limitations by introducing
shared attention parameters to better model the fine-grained relationship between both modalities.
We test and compare our proposed approach using the expertly annotated MiltiM-Poem dataset,
which is considered the largest public image–poetry pair dataset for English poetry. The proposed
approach aims to solve the problems of image-based attribute recognition and automatic retrieval
for fine-grained poetic verses. The test results reflect that the shared attention parameters alleviate
fine-grained attribute recognition, and the proposed approach is a significant step towards automatic
multi-modal retrieval for improved artwork appreciation of people with visual impairments.
Keywords: image-based poetry retrieval; fine-grained attribute recognition; accessibility; multi-
modal attention; cross-encoder
1. Introduction
Poets often encompass sentiments, themes, and messages they intend to articulate
implicitly through poetic verses. This implicit artistic conception by a poet is a unique
feature of human-authored poetry as opposed to machine-generated poetry. Additionally,
metaphor and symbolism are commonly employed for this type of poetry. Therefore,
the message and feelings are characterized by symbolism, scenes, metaphor, activities,
objects, and color tones rather than relying merely upon the objects in an image or the
color tones. Existing solutions in image–text retrieval mainly focus on the concurrence
of objects in an image from a verbal description of objects through image captioning or
training on image captioning datasets. As a result of this, two matching poems from
the candidate poetry dataset carrying the same notion but expressed differently may
be regarded with distant retrieval rankings and vice versa, based on words matching
intuition, for instance [1]. Fine-grained artwork and poetry-attribute recognition assume
extensive domain knowledge, and, therefore, proper feature learning is a herculean task
for conventional methods and classical CNN-based methods. Visually impaired visitors
experience visual artwork appreciation limitations, such as a lack of sensory and cognitive
access to exhibit artworks or replicas. The visual artworks appreciation opportunities
Electronics 2022, 11, 1275. https://doi.org/10.3390/electronics11081275 https://www.mdpi.com/journal/electronics