Received: June 15, 2020. Revised: July 2, 2020. 257 International Journal of Intelligent Engineering and Systems, Vol.13, No.5, 2020 DOI: 10.22266/ijies2020.1031.23 Speech Emotion Recognition Using MELBP Variants of Spectrogram Image Suhaila N. Mohammed 1,2 * Alia K. Abdul Hassan 1 1 Computer Sciences Department, University of Technology, Baghdad, Iraq 2 Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq * Corresponding author’s Email: suhailan.mo@sc.uobaghdad.edu.iq Abstract: Speech emotion recognition finds many applications in the daily life like conversational agents, human robot interaction, call centres etc. However; the task of emotion recognition from speech signal is not trivial due to the difficulty in determining the effective feature set that can recognize the emotion conveyed within the signal in an accurate manner. Image processing techniques are exploited in this paper to solve speech emotion recognition problem. After converting the signal into 2D spectrogram image representation, four forms of Extended Local Binary Pattern (ELBP) are generated to serve as a source for feature extraction stage. The histograms of multiple blocks from ELBP variants are computed and fed to Deep Belief Network (DBN) for classification purpose. Different tests were performed using Surrey Audio-Visual Expressed Emotion (SAVEE) database and the achieved results showed that when using combined vectors of MELBP, the system gives the best accuracy which is 72.14%. The achieved result outperforms state-of-the-art results on the same database. Keywords: Speech emotion, Spectrogram image, Multi-block extended local binary pattern (MELBP), Deep belief network (DBN), Short term fourier transform (STFT). 1. Introduction Speech Emotion Recognition (SER) refers to the extraction of feelings from speech signals. Different applications, relying on the user's emotional state, can benefit from SER systems such as human-robot interaction, pain and lying detection, computer-based tutorial systems, and movie or music recommendation systems [1, 2]. In general, a SER system uses a classifier to recognize the emotion from the feature vector that is extracted from the speech signal. A SER system must be robust against speaking rate and speaker style. It means special characteristics like differences in age, gender and culture should not impact on the performance of the SER system. Many researchers are working in this field to give the machine the intelligence in understanding the emotion state using the speech signal of the user. In SER, the investigation and extraction of relevant and discriminatory features is a difficult task [3]. There are different methods used for speech feature extraction such as continuous-based features, spectral-based speech features and digital image processing techniques. Many researchers believe that effective continuous features such as pitch and energy reflect most of an utterance's emotional content, for example, the speaker's arousal state influences the overall energy, and the length and duration of speech pauses. Some of famous continuous acoustic features are energy related features, pitch, formants and timing related features. Pitch is a fundamental property of the speech signal. The pitch describes the highness and lowness of tone in the speech. Pitch features increased with high- arousal emotions such as happy and surprise emotions while decreased with low-arousal emotions such as sad and fear. In phonetics, formant essentially means the acoustic resonance of the vocal tract of the human [4]. They can be extracted by finding the amplitude peaks in the frequency spectrum of the speech. Timing-related features provide information about the distribution of duration-related parameters such speech rate, the