  Citation: Jabbar, M.S.; Shin, J.; Cho, J.-D. AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval. Electronics 2022, 11, 1275. https:// doi.org/10.3390/electronics11081275 Academic Editor: George A. Tsihrintzis Received: 12 March 2022 Accepted: 14 April 2022 Published: 18 April 2022 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affil- iations. Copyright: © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). electronics Article AI Ekphrasis: Multi-Modal Learning with Foundation Models for Fine-Grained Poetry Retrieval Muhammad Shahid Jabbar 1 , Jitae Shin 1, * and Jun-Dong Cho 1,2, * 1 Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon 16419, Korea; eeshahid@skku.edu 2 Department of Human ICT Convergence, Sungkyunkwan University, Suwon 16419, Korea * Correspondence: jtshin@skku.edu (J.S.); jdcho@skku.edu (J.-D.C.) Abstract: Artificial intelligence research in natural language processing in the context of poetry struggles with the recognition of holistic content such as poetic symbolism, metaphor, and other fine-grained attributes. Given these challenges, multi-modal image–poetry reasoning and retrieval remain largely unexplored. Our recent accessibility study indicates that poetry is an effective medium to convey visual artwork attributes for improved artwork appreciation of people with visual impairments. We, therefore, introduce a deep learning approach for the automatic retrieval of poetry suitable to the input images. The recent state-of-the-art CLIP provides a way for multi-modal visual and text features matched using cosine similarity. However, it lacks shared cross-modality attention features to model fine-grained relationships. The proposed approach in this work takes advantage of strong pre-training of the CLIP model and overcomes its limitations by introducing shared attention parameters to better model the fine-grained relationship between both modalities. We test and compare our proposed approach using the expertly annotated MiltiM-Poem dataset, which is considered the largest public image–poetry pair dataset for English poetry. The proposed approach aims to solve the problems of image-based attribute recognition and automatic retrieval for fine-grained poetic verses. The test results reflect that the shared attention parameters alleviate fine-grained attribute recognition, and the proposed approach is a significant step towards automatic multi-modal retrieval for improved artwork appreciation of people with visual impairments. Keywords: image-based poetry retrieval; fine-grained attribute recognition; accessibility; multi- modal attention; cross-encoder 1. Introduction Poets often encompass sentiments, themes, and messages they intend to articulate implicitly through poetic verses. This implicit artistic conception by a poet is a unique feature of human-authored poetry as opposed to machine-generated poetry. Additionally, metaphor and symbolism are commonly employed for this type of poetry. Therefore, the message and feelings are characterized by symbolism, scenes, metaphor, activities, objects, and color tones rather than relying merely upon the objects in an image or the color tones. Existing solutions in image–text retrieval mainly focus on the concurrence of objects in an image from a verbal description of objects through image captioning or training on image captioning datasets. As a result of this, two matching poems from the candidate poetry dataset carrying the same notion but expressed differently may be regarded with distant retrieval rankings and vice versa, based on words matching intuition, for instance [1]. Fine-grained artwork and poetry-attribute recognition assume extensive domain knowledge, and, therefore, proper feature learning is a herculean task for conventional methods and classical CNN-based methods. Visually impaired visitors experience visual artwork appreciation limitations, such as a lack of sensory and cognitive access to exhibit artworks or replicas. The visual artworks appreciation opportunities Electronics 2022, 11, 1275. https://doi.org/10.3390/electronics11081275 https://www.mdpi.com/journal/electronics