A Review on Methods and Applications in Multimodal Deep Learning SUMMAIRA JABEEN and XI LI ∗ , College of Computer Science, Zhejiang University, China AMIN MUHAMMAD SHOIB, School of Software Engineering, East China Normal University, China OMAR BOURAHLA, SONGYUAN LI, and ABDUL JABBAR, College of Computer Science, Zhejiang University, China Deep Learning has implemented a wide range of applications and has become increasingly popular in recent years. The goal of multimodal deep learning (MMDL) is to create models that can process and link information using various modalities. Despite the extensive development made for unimodal learning, it still cannot cover all the aspects of human learning. Multimodal learning helps to understand and analyze better when various senses are engaged in the processing of information. This paper focuses on multiple types of modalities, i.e., image, video, text, audio, body gestures, facial expressions, physiological signals, low, RGB, pose, depth, mesh, and point cloud. Detailed analysis of the baseline approaches and an in-depth study of recent advancements during the last ive years (2017 to 2021) in multimodal deep learning applications has been provided. A ine-grained taxonomy of various multimodal deep learning methods is proposed, elaborating on diferent applications in more depth. Lastly, main issues are highlighted separately for each domain, along with their possible future research directions. CCS Concepts: • Computing methodologies → Machine learning;• Information systems → Multimedia and multimodal retrieval. Additional Key Words and Phrases: Deep Learning, Multimedia, Multimodal learning, datasets, Neural Networks, Survey 1 Introduction Multimodal learning proposes that we are able to remember and understand more when engaging multiple senses during the learning process. MMDL technically contains diferent aspects and challenges like representation, translation, alignment, fusion, co-learning when learning from two or more modalities [1, 2]. This information from multiple sources is contextually related and occasionally provides the additional necessary information to one another, revealing features that would not be viewable when working with individual modalities. MMDL models combine heterogeneous data from multiple sources, allowing for more appropriate predictions [3]. Extracting and presenting relevant information from multimodal data remains an inspirational motive for MMDL research. Merging various modalities to optimize efectiveness is still an appealing challenge. Furthermore, the accuracy and lexibility of multimodal systems are not optimum due to the insuiciency of labeled data. The recent advances and trends of MMDL are from Audio-visual speech recognition (AVSR) [4, 5], multimedia content indexing and retrieval [6ś9], understanding human multimodal behaviors during social interaction, ∗ Corresponding Author: xilizju@zju.edu.cn Authors’ addresses: Summaira Jabeen, 11821129@zju.edu.cn; Xi Li, xilizju@zju.edu.cn, College of Computer Science, Zhejiang University, China, P.O. Box W-99, Hangzhou, 310027; Amin Muhammad Shoib, School of Software Engineering, East China Normal University, 3663 North Zhongshan Road., Shanghai, China, 52184501030@stu.ecnu.edu.cn; Omar Bourahla, bourahla@zju.edu.cn; Songyuan Li, leizungjyun@ zju.edu.cn; Abdul Jabbar, Jabbar@zju.edu.cn, College of Computer Science, Zhejiang University, China, P.O. Box W-99, Hangzhou, 310027. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proit or commercial advantage and that copies bear this notice and the full citation on the irst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speciic permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 1551-6857/2022/10-ART $15.00 https://doi.org/10.1145/3545572 ACM Trans. Multimedia Comput. Commun. Appl.