Evaluating Gammatone Frequency Cepstral Coefﬁcients with Neural Networks for Emotion Recognition from Speech Gabrielle K. Liu Ravenwood High School Brentwood, TN 37027 gkml@mit.edu Abstract Current approaches to speech emotion recognition focus on speech features that can capture the emotional content of a speech signal. Mel Frequency Cepstral Coefﬁcients (MFCCs) are one of the most commonly used representations for audio speech recognition and classiﬁcation. This paper proposes Gammatone Frequency Cepstral Coefﬁcients (GFCCs) as a potentially better representation of speech signals for emotion recognition. The effectiveness of MFCC and GFCC rep- resentations are compared and evaluated over emotion and intensity classiﬁcation tasks with fully connected and recurrent neural network architectures. The results provide evidence that GFCCs outperform MFCCs in speech emotion recognition. 1 Introduction In recent years, human-computer interactions have become increasingly representative of realistic interpersonal interactions. AI assistants are now able to understand much of the content of human speech. Humans also convey information through emotional cues, and current AI technologies are unable to engage in emotion communication. This gap and the potential beneﬁts arising from emotion artiﬁcial intelligence have prompted growing interest in speech emotion recognition. Emotion recognition is usually modeled as a classiﬁcation task. Accordingly, an important question lies in how to best represent a speech signal and capture its emotional content—that is, which features of speech should be used to generate a speech representation? Mel Frequency Cepstral Coefﬁcients (MFCCs) are one of the most commonly used speech features for speech recognition. Existing studies have found that MFCCs lead in performance in comparison to other commonly used speech features (e.g. loudness, formants, linear predictive coefﬁcients) [VM16; JSDN14; KJPS12]. While MFCCs have gained attention in recent years in the context of speech emotion recognition, Gammatone Frequency Cepstral Coefﬁcients (GFCCs) have remained underappreciated. GFCCs are sometimes used in speech and speaker recognition systems [Bur14; XSZ16; SW08]. In contrast to MFCCs which are based upon the Mel Filter Bank, GFCCs are based upon the Gammatone Filter Bank, where the ﬁlters model physiological changes in the inner ear and external middle ear [KDL07]. Compared to MFCCs, they are more robust against noise are often used in speaker identiﬁcation systems [ZW13; Jee+17; RSS17]. In this study, we propose that GFCCs are superior to MFCCs for speech emotion recognition. Speciﬁcally, we seek to evaluate GFCC representations of speech versus MFCC representations for the tasks of speech emotion and intensity classiﬁcation. In section 2, we describe the experimental setup. In section 3 we discuss results and conclude by outlining future work. Preprint. Work in progress. arXiv:1806.09010v1 [cs.SD] 23 Jun 2018