Electronics 2023, 12, 288. https://doi.org/10.3390/electronics12020288 www.mdpi.com/journal/electronics
Article
Facial Emotion Recognition with Inter-Modality-Attention-
Transformer-Based Self-Supervised Learning
Aayushi Chaudhari
1
, Chintan Bhatt
2,
*, Achyut Krishna
1
and Carlos M. Travieso-González
3
1
U & P U. Patel Department of Computer Engineering, Chandubhai S Patel Institute of Technology (CSPIT),
CHARUSAT Campus, Charotar University of Science and Technology, Changa 388421, India
2
Department of Computer Science and Engineering, School of Technology, Pandit Deendayal Energy
University, Gandhinagar 382007, India
3
Signals and Communications Department, IDeTIC, University of Las Palmas de Gran Canaria,
35001 Las Palmas, Spain
* Correspondence: chintan.bhatt@sot.pdpu.ac.in; Tel.: +91-9909953994
Abstract: Emotion recognition is a very challenging research field due to its complexity, as individ-
ual differences in cognitive–emotional cues involve a wide variety of ways, including language,
expressions, and speech. If we use video as the input, we can acquire a plethora of data for analyzing
human emotions. In this research, we use features derived from separately pretrained self-super-
vised learning models to combine text, audio (speech), and visual data modalities. The fusion of
features and representation is the biggest challenge in multimodal emotion classification research.
Because of the large dimensionality of self-supervised learning characteristics, we present a unique
transformer and attention-based fusion method for incorporating multimodal self-supervised learn-
ing features that achieved an accuracy of 86.40% for multimodal emotion classification.
Keywords: self-attention transformer; multimodality; inter-modality attention transformer;
contextual emotion recognition; depth of emotional dimensionality; computer vision; real-time
application
1. Introduction
Emotion recognition and sentiment analysis have recently received a lot of attention
due to their numerous applications, such as in human–computer interactions, education,
and healthcare robotics. The correlation between the information reflected in and trans-
mitted by the facial expression and the person’s contemporaneous emotional state is a hot
topic in both customer service and education research. Earlier techniques for encoding
data modalities included emotion identification features such as mel-frequency cepstral
coefficients (MFCC) [1], elements of facial muscle activity, and glove embedding [2]. Re-
cent studies [3,4] have looked into the use of transfer learning approaches for extracting
features from pretrained deep learning (DL) models as opposed to low-level features. The
primary purpose of our research was to create contextualized representations from these
extracted features using transformer-based architecture, and then use these representa-
tions to evaluate low/high degrees of arousal and valence. The goal of our study was to
extract face expressions and acoustic sound features [5] from trained DL models for su-
pervised learning. Previous work has combined low-level and deep features rather than
representing all modalities using characteristics derived from trained deep learning mod-
els [6]. In contrast to earlier research, we have used deep features taken from pretrained
self-supervised learning models for representing all input modalities (audio, video, and
text) [7–9]. RoBERTa [9], FAb-net [9], and Wav2Vec [9] are three freely accessible pre-
trained self-supervised learning (SSL) embedded models that we used to represent text,
speech, and facial expressions. Emotional indicators that are transferable across speakers,
Citation: Chaudhari, A.; Bhatt, C.;
Krishna, A.;
Travieso-González, C.M. Facial
Emotion Recognition with
Inter-Modality-Attention-Trans-
former-Based Self-Supervised
Learning. Electronics 2023, 12, 288.
https://doi.org/10.3390/
electronics12020288
Academic Editors: Fabio Mendonca,
Morgado Días and Sheikh Shanawaz
Mostafa
Received: 16 December 2022
Revised: 30 December 2022
Accepted: 2 January 2023
Published: 5 January 2023
Copyright: © 2023 by the authors. Li-
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license
(https://creativecommons.org/licen-
ses/by/4.0/).