A Review on Methods and Applications in Multimodal Deep Learning
SUMMAIRA JABEEN and XI LI
∗
, College of Computer Science, Zhejiang University, China
AMIN MUHAMMAD SHOIB, School of Software Engineering, East China Normal University, China
OMAR BOURAHLA, SONGYUAN LI, and ABDUL JABBAR, College of Computer Science, Zhejiang
University, China
Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of
multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite
the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal
learning helps to understand and analyze better when various senses are engaged in the processing of information. This
paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological
signals, low, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of
recent advancements during the last ive years (2017 to 2021) in multimodal deep learning applications has been provided.
A ine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on diferent applications
in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research
directions.
CCS Concepts: • Computing methodologies → Machine learning;• Information systems → Multimedia and
multimodal retrieval.
Additional Key Words and Phrases: Deep Learning, Multimedia, Multimodal learning, datasets, Neural Networks, Survey
1 Introduction
Multimodal learning proposes that we are able to remember and understand more when engaging multiple senses
during the learning process. MMDL technically contains diferent aspects and challenges like representation,
translation, alignment, fusion, co-learning when learning from two or more modalities [1, 2]. This information
from multiple sources is contextually related and occasionally provides the additional necessary information to one
another, revealing features that would not be viewable when working with individual modalities. MMDL models
combine heterogeneous data from multiple sources, allowing for more appropriate predictions [3]. Extracting
and presenting relevant information from multimodal data remains an inspirational motive for MMDL research.
Merging various modalities to optimize efectiveness is still an appealing challenge. Furthermore, the accuracy
and lexibility of multimodal systems are not optimum due to the insuiciency of labeled data.
The recent advances and trends of MMDL are from Audio-visual speech recognition (AVSR) [4, 5], multimedia
content indexing and retrieval [6ś9], understanding human multimodal behaviors during social interaction,
∗
Corresponding Author: xilizju@zju.edu.cn
Authors’ addresses: Summaira Jabeen, 11821129@zju.edu.cn; Xi Li, xilizju@zju.edu.cn, College of Computer Science, Zhejiang University,
China, P.O. Box W-99, Hangzhou, 310027; Amin Muhammad Shoib, School of Software Engineering, East China Normal University, 3663
North Zhongshan Road., Shanghai, China, 52184501030@stu.ecnu.edu.cn; Omar Bourahla, bourahla@zju.edu.cn; Songyuan Li, leizungjyun@
zju.edu.cn; Abdul Jabbar, Jabbar@zju.edu.cn, College of Computer Science, Zhejiang University, China, P.O. Box W-99, Hangzhou, 310027.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that
copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst
page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from
permissions@acm.org.
© 2022 Association for Computing Machinery.
1551-6857/2022/10-ART $15.00
https://doi.org/10.1145/3545572
ACM Trans. Multimedia Comput. Commun. Appl.