(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 12, 2021 Micro Expression Recognition: Multi-scale Approach to Automatic Emotion Recognition by using Spatial Pyramid Pooling Module Lim Jun Sian, Marzuraikah Mohd Stofa, Koo Sie Min, Mohd Asyraf Zulkifley Department of Electrical, Electronic and Systems Engineering, Universiti Kebangsaan Malaysia, Bangi, Malaysia Abstract—Facial expression is one of the obvious cues that humans used to express their emotions. It is a necessary aspect of social communication between humans in their daily lives. However, humans do hide their real emotions in certain circumstances. Therefore, facial micro-expression has been observed and analyzed to reveal the true human emotions. However, micro-expression is a complicated type of signal that manifests only briefly. Hence, machine learning techniques have been used to perform micro-expression recognition. This paper introduces a compact deep learning architecture to classify and recognize human emotions of three categories, which are positive, negative, and surprise. This study utilizes the deep learning approach so that optimal features of interest can be extracted even with a limited number of training samples. To further improve the recognition performance, a multi-scale module through the spatial pyramid pooling network is embedded into the compact network to capture facial expressions of various sizes. The base model is derived from the VGG-M model, which is then validated by using combined datasets of CASMEII, SMIC, and SAMM. Moreover, various configurations of the spatial pyramid pooling layer were analyzed to find out the most optimal network setting for the micro-expression recognition task. The experimental results show that the addition of a multi- scale module has managed to increase the recognition performance. The best network configuration from the experiment is composed of five parallel network branches that are placed after the second layer of the base model with pooling kernel sizes of two, three, four, five, and six. Keywords—Micro expression recognition; facial expression; spatial pyramid pooling module; multi-scale approach; deep learning I. INTRODUCTION According to the research from [1], [2], faces are the main human “tools” to express information in terms of emotion. Facial expression is an important means that enable humans to undergo social interaction with each other. This is because 55% of human feelings are manifested by their facial expression. For example, an observer can deduce that someone is feeling disgusting if his/her upper lip is rising upward. Facial expression can be broken down into two categories, which are macro-expression and micro-expression. A macro- expression is an intentional facial expression, while a micro- expression is an unintentional facial expression. Benjamin et al. [3] investigated that the major differences between them are the intensity and time taken to manifest the expression. Deng et al. [4] reported both expressions are widely used as an input to various applications and the most obvious application is to estimate the hidden emotions. On the other hand, Micro-expression (ME) is an unintentional, quick facial movement that is primarily used to express the emotions of happiness, sadness, and surprise [5]. A ME happened in a short time, usually happened in the range of 0.04s until 0.2s. Hence, it is a hard task for a human to use their bare eyes to detect the occurrence of ME. Even if a human is undergoing training to detect an ME, their average performance is only slightly better than other people who do not undergo the training process. Hence, Zhao and Li [6] showed that machine learning is proposed to aid humans in analyzing the ME to understand human’s true emotions. Machine learning (ML) can be broadly classified into traditional machine learning and deep learning. Researchers in pattern recognition tasks have frequently applied both techniques to the applications of facial expression recognition [7], human activity recognition [8], recycling system [9], and image recognition [10]. Traditional machine learning relies on a set of handcrafted features, which is then passed to a decision-making module algorithm such as decision tree, neural network, and Support Vector Machine (SVM) [11], [12]. However, it is a time-consuming task for a computer vision engineer to judge which features are the best to describe the emotions. The deep learning methodology is different compared to the traditional machine learning approach, whereby the features of interest are obtained through iterative optimal training such as through the convolution process [10], [13]. Usually, after the feature maps have passed through a convolution process, they will undergo a pooling process. Fig. 1 shows the generalized framework of traditional machine learning and deep learning algorithms for human emotion recognition tasks. 583 | Page www.ijacsa.thesai.org