Electronics 2023, 12, 288. https://doi.org/10.3390/electronics12020288 www.mdpi.com/journal/electronics Article Facial Emotion Recognition with Inter-Modality-Attention- Transformer-Based Self-Supervised Learning Aayushi Chaudhari 1 , Chintan Bhatt 2, *, Achyut Krishna 1 and Carlos M. Travieso-González 3 1 U & P U. Patel Department of Computer Engineering, Chandubhai S Patel Institute of Technology (CSPIT), CHARUSAT Campus, Charotar University of Science and Technology, Changa 388421, India 2 Department of Computer Science and Engineering, School of Technology, Pandit Deendayal Energy University, Gandhinagar 382007, India 3 Signals and Communications Department, IDeTIC, University of Las Palmas de Gran Canaria, 35001 Las Palmas, Spain * Correspondence: chintan.bhatt@sot.pdpu.ac.in; Tel.: +91-9909953994 Abstract: Emotion recognition is a very challenging research field due to its complexity, as individ- ual differences in cognitiveemotional cues involve a wide variety of ways, including language, expressions, and speech. If we use video as the input, we can acquire a plethora of data for analyzing human emotions. In this research, we use features derived from separately pretrained self-super- vised learning models to combine text, audio (speech), and visual data modalities. The fusion of features and representation is the biggest challenge in multimodal emotion classification research. Because of the large dimensionality of self-supervised learning characteristics, we present a unique transformer and attention-based fusion method for incorporating multimodal self-supervised learn- ing features that achieved an accuracy of 86.40% for multimodal emotion classification. Keywords: self-attention transformer; multimodality; inter-modality attention transformer; contextual emotion recognition; depth of emotional dimensionality; computer vision; real-time application 1. Introduction Emotion recognition and sentiment analysis have recently received a lot of attention due to their numerous applications, such as in humancomputer interactions, education, and healthcare robotics. The correlation between the information reflected in and trans- mitted by the facial expression and the persons contemporaneous emotional state is a hot topic in both customer service and education research. Earlier techniques for encoding data modalities included emotion identification features such as mel-frequency cepstral coefficients (MFCC) [1], elements of facial muscle activity, and glove embedding [2]. Re- cent studies [3,4] have looked into the use of transfer learning approaches for extracting features from pretrained deep learning (DL) models as opposed to low-level features. The primary purpose of our research was to create contextualized representations from these extracted features using transformer-based architecture, and then use these representa- tions to evaluate low/high degrees of arousal and valence. The goal of our study was to extract face expressions and acoustic sound features [5] from trained DL models for su- pervised learning. Previous work has combined low-level and deep features rather than representing all modalities using characteristics derived from trained deep learning mod- els [6]. In contrast to earlier research, we have used deep features taken from pretrained self-supervised learning models for representing all input modalities (audio, video, and text) [79]. RoBERTa [9], FAb-net [9], and Wav2Vec [9] are three freely accessible pre- trained self-supervised learning (SSL) embedded models that we used to represent text, speech, and facial expressions. Emotional indicators that are transferable across speakers, Citation: Chaudhari, A.; Bhatt, C.; Krishna, A.; Travieso-González, C.M. Facial Emotion Recognition with Inter-Modality-Attention-Trans- former-Based Self-Supervised Learning. Electronics 2023, 12, 288. https://doi.org/10.3390/ electronics12020288 Academic Editors: Fabio Mendonca, Morgado Días and Sheikh Shanawaz Mostafa Received: 16 December 2022 Revised: 30 December 2022 Accepted: 2 January 2023 Published: 5 January 2023 Copyright: © 2023 by the authors. Li- censee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and con- ditions of the Creative Commons At- tribution (CC BY) license (https://creativecommons.org/licen- ses/by/4.0/).