Z. Pakdaman et al./ Journal of Optimization of Soft Computing (JOSC), 2(3): 56-68, 2024 56 Journal of Optimization of Soft Computing (JOSC) Vol. 2, Issue 3, pp: (56-68), Autumn-2024 Journal homepage: https://sanad.iau.ir/journal/josc Paper Type (Research paper) Transformer-based Meme-sensitive Cross-modal Sentiment Analysis Using Visual-Textual Data in Social Media Zahra Pakdaman 1 , Abbas Koochari 1* and Arash Sharifi 1 1. Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran. Article Info Abstract Article History: Received: 2024/10/31 Revised: 2024/12/01 Accepted: 2024/12/08 DOI: Analyzing the sentiment of the social media data plays a crucial role in understanding users’ intentions, opinions, and behaviors. Given the extensive diversity of published content (i.e., image, text, audio, and video) leveraging this variety can significantly enhance the accuracy of sentiment analysis models. This study introduces a novel meme- sensitive cross-modal architecture designed to analyze users’ emotions by integrating visual and textual data. The proposed approach distinguishes itself by its capability to identify memes within image datasets, an essential step in recognizing context-rich and sentiment- driven visual content. The research methodology involves detecting memes and separating them from standard images. Form memes, embedded text is extracted and combined with user-generated captions, forming a unified textual input. Advanced feature extraction techniques are then applied: Vision Transformer (ViT) is employed for extracting visual features, while SBERT Bi-encoder is utilized to obtain meaningful textual embeddings. To address the challenges posed by high-dimensional data, Linear Discriminant Analysis (LDA) is used to reduce feature dimensionality while preserving critical classification information. A carefully designed neural network, consisting of two fully connected layers, processes the fused feature vector to predict sentiment classes. Experimental evaluation demonstrates the efficiency of the proposed method, achieving up to 90% accuracy on the MVSA- Single dataset and 80% accuracy on the MVSA-Multiple dataset. These results underscore the model’s ability to outperform existing state-of-the-art approaches in cross-modal sentiment analysis. This study highlights the importance of integrating meme recognition and multi-modal feature extraction for improving sentiment analysis, paving the way for feature research in this domain. Keywords: Visual Sentiment Analysis, Textual Sentiment Analysis, Vision Transformer, LDA, SBERT Bi-encoder *Corresponding Author’s Email Address: koochari@srbiau.ac.ir 1. Introduction Sentiment analysis is a method to extract the user’s real feelings from the data (text, image, video, and audio) that he/she publishes on the web platform. Considering that until the last few years, textual content was the most popular and most frequently published content on websites and social media, textual data attracted the attention of researchers. As a result, several methods were presented in text sentiment analysis, and the methods proposed in this field have become very rich and fruitful. In recent years, when image-oriented platforms such as Facebook, Instagram, etc., came to light, a large amount of image data was published by users. In this respect, the importance of analyzing the sentiment of this type of data was