Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech Gabrielle K. Liu Ravenwood High School Brentwood, TN 37027 gkml@mit.edu Abstract Current approaches to speech emotion recognition focus on speech features that can capture the emotional content of a speech signal. Mel Frequency Cepstral Coefficients (MFCCs) are one of the most commonly used representations for audio speech recognition and classification. This paper proposes Gammatone Frequency Cepstral Coefficients (GFCCs) as a potentially better representation of speech signals for emotion recognition. The effectiveness of MFCC and GFCC rep- resentations are compared and evaluated over emotion and intensity classification tasks with fully connected and recurrent neural network architectures. The results provide evidence that GFCCs outperform MFCCs in speech emotion recognition. 1 Introduction In recent years, human-computer interactions have become increasingly representative of realistic interpersonal interactions. AI assistants are now able to understand much of the content of human speech. Humans also convey information through emotional cues, and current AI technologies are unable to engage in emotion communication. This gap and the potential benefits arising from emotion artificial intelligence have prompted growing interest in speech emotion recognition. Emotion recognition is usually modeled as a classification task. Accordingly, an important question lies in how to best represent a speech signal and capture its emotional content—that is, which features of speech should be used to generate a speech representation? Mel Frequency Cepstral Coefficients (MFCCs) are one of the most commonly used speech features for speech recognition. Existing studies have found that MFCCs lead in performance in comparison to other commonly used speech features (e.g. loudness, formants, linear predictive coefficients) [VM16; JSDN14; KJPS12]. While MFCCs have gained attention in recent years in the context of speech emotion recognition, Gammatone Frequency Cepstral Coefficients (GFCCs) have remained underappreciated. GFCCs are sometimes used in speech and speaker recognition systems [Bur14; XSZ16; SW08]. In contrast to MFCCs which are based upon the Mel Filter Bank, GFCCs are based upon the Gammatone Filter Bank, where the filters model physiological changes in the inner ear and external middle ear [KDL07]. Compared to MFCCs, they are more robust against noise are often used in speaker identification systems [ZW13; Jee+17; RSS17]. In this study, we propose that GFCCs are superior to MFCCs for speech emotion recognition. Specifically, we seek to evaluate GFCC representations of speech versus MFCC representations for the tasks of speech emotion and intensity classification. In section 2, we describe the experimental setup. In section 3 we discuss results and conclude by outlining future work. Preprint. Work in progress. arXiv:1806.09010v1 [cs.SD] 23 Jun 2018