(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 12, 2022 656 | Page www.ijacsa.thesai.org Emotion Recognition on Multimodal with Deep Learning and Ensemble David Adi Dharma, Amalia Zahra Computer Science Department-BINUS Graduate Program-Master of Computer Science Bina Nusantara University, Jakarta, Indonesia 11480 Abstract—Emotion Recognition on multimodal dataset is a difficult task, which is one of the most important tasks in topics like Human Computer Interaction (HCI). This paper presents a multimodal approach for emotion recognition on dataset MELD. The dataset contains three modalities, audio, text, and facial features. In this research, only audio and text features will be experimented on. For audio data, the raw audio is converted into MFCC as an input to a bidirectional LSTM, which will be built to perform emotion classification. On the other hand, BERT will be used to tokenize the text data as an input to the text model. To classify the emotion in text data, a Bidirectional LSTM will be built. And finally, the voting ensemble method will be implemented to combine the result from two modalities. The model will be evaluated using F1-score and confusion matrix. The unimodal audio model achieved 41.69% of F1-score, while the unimodal text model achieved 47.29% of F1-score, and the voting ensemble model achieved 47.47% of F1-score. To conclude this research, this paper also discussed future works, which involved how to build and improve deep learning models and combine them with ensemble model for better performance in emotion recognition tasks in multimodal dataset. Keywords—Emotion recognition; deep learning; ensemble method; transformer; natural language processing I. INTRODUCTION Humans, along with technological developments pour their emotions or feelings either through some media such as text, photos, audio, or video recordings. Human emotions are complex, which make it difficult to be studied or predicted, and it takes a high level of intelligence to be able to recognize the emotions expressed by people in the current media [1]. Due to the complexity of human emotions, the variety of human’s feelings, and the media where they convey their emotions, the AI model’s learning has evolved to multimodal datasets, where existing media, audio, video, text, biological information, can improve the model's ability to classify emotions more accurately [2]. This emotion recognition task is involved in some study subjects, which are Natural Language Processing (NLP) and Machine Learning (ML). The algorithms used to classify emotions are also being actively researched and developed. The same algorithm used for classification can have different results, depending on the dataset used. The datasets used for classification are media such as text, or images, or EEG (Electroencephalogram) signals, as well as sound. Examples of Machine Learning (ML) method that was used for classifying emotions are SVM (Support Vector Machine), KNN (K-Nearest Neighbor), and Bayesian Network in the research [3]. Then the emotion classification method developed to neural networks-based model which are Deep Learning (DL) such as Recurrent Neural Network (RNN) in [4] research, Deep Neural Network (DNN) in [5], DialogueRNN which is RNN-based model in the research by [6], LSTM (Long Short-Term Memory) and CNN (Convolutional Neural Network) conducted in the [7] experiment. Lastly, a Bidirectional LSTM-CNN model was used to learn context and classify emotions in the [8] and [9] works. Transformer which is proposed in 2017 [10] is the neural network architecture. This model outperformed any RNN and LSTM model that were popular in NLP. Thus, Transformer develops into BERT in 2019 [11]. BERT stands for Bidirectional Encoder Representation of Transformer, and until today BERT has been widely used to extract contextual information from texts. BERT also used in emotion recognition task on the text dataset, in [11] research published in 2022. An ensemble learning [12] is a technique that combines base predictors model and improves it to become more outstanding predictor. There are few kinds of ensemble learnings which are called bagging, boosting, and stacking [13]. In the [14], there is ensemble learning called voting, which choose the highest probability value as the final classification result. In this experiment, the voting ensemble learning will be performed. There are some datasets that are widely used in multimodal emotion classification research, such as IEMOCAP (The Interactive Emotional Dyadic Motion Capture) and MELD (Multimodal EmotionLines Dataset). MELD was made and published in 2018 by [15]. These works have used IEMOCAP dan MELD for building and developing models that are used to classify emotion in multimodal. This paper built a model using MELD dataset, which contains 7 labels of emotions, such as anger, disgust, sadness, joy, neutral, surprise and fear. All data in the MELD dataset are in English. There are three modalities that are provided the dataset, video which are audio and facial data, and textual data. The purposes of the experiment are to extract emotion features in multimodal dataset and build a model that could recognize the emotion that is learned from the features. This experiment also intends to evaluate the model built and compare the evaluation result with the existing models.