NEURAL ARCHITECTURE OF SPEECH Subba Reddy Oota 1,2 , Khushbu Pahwa 3 , Mounika Marreddy 2 , Manish Gupta 2,4 , Bapi S. Raju 2 1 Inria Bordeaux, France, 2 IIIT Hyderabad, 3 University of California, LA, 4 Microsoft ABSTRACT A vast literature on brain encoding has effectively harnessed deep neural network models for accurately predicting brain activations from visual or text stimuli. Unfortunately, there is not much work on brain encoding for speech stimuli. The few existing studies on brain encoding for speech stimuli transcribe speech to text and then leverage text- only models for encoding, thereby ignoring audio signals completely. However, recently several speech representation learning models have revolutionized the field of speech processing. Inspired by the recent progress on deep learning models for speech, we present a first systematic study on understanding human speech processing by probing neural speech models to predict both language and auditory brain region activations. In particular, we investigate 30 speech representation models grouped into four categories: (i) tradi- tional feature engineering, (ii) generative, (iii) predictive, and (iv) contrastive, to study how these models encode the speech stimuli and align with human brain activity for the Moth Radio Hour fMRI (functional magnetic resonance imaging) dataset. We find that both contrastive (Wav2Vec2.0) and predictive models (HuBERT, Data2Vec) are very accurate. Specifically, Data2Vec aligns the best with both language and auditory brain regions among all investigated models. We make our code publicly available 1 . Index Terms—brain encoding for speech, deep learning speech models, Data2Vec I. INTRODUCTION In computational cognitive science, brain encoding is the problem of predicting brain activations from stimuli [1]. For the past two decades, researchers have focused on mapping stimulus representations to brain activations through encod- ing models for text and vision. For text, researchers have explored both syntactic as well as semantic representations, the most recent ones using Transformer-based deep learning (DL) methods [2]–[4]. For vision, the encoding models have leveraged our algorithmic understanding of visual hierarchy (V1, V2, V4, IT) in the visual cortex and hence follow the convolutional neural network-based design [5], [6]. Recently, several deep learning models have been shown to be very effective for speech-processing tasks like speech recognition, speech synthesis, speaker recognition, etc [7]. 1 https://github.com/kpahwa16/Neural-Architecture-of-Speech Speech Stimulus Speech Stimulus Ridge Regression (W) Pearson Correlation(PCC) = corr(Y, XW)) Speech Models Contrastive Generative Predictive Speech stimulus Representation (X) fMRI brain activity recording (Y) Fig. 1. Broad architecture of studying alignment of neural encoding models for speech with fMRI brain activations. While there has been a rigorous evaluation of various deep learning-based text [8] and vision [9] stimuli representa- tions for brain encoding, speech stimuli have mostly been represented using encodings of text transcriptions [10] or using basic features like phoneme rate, the sum of squared FFT coefficients [11], etc. Text transcription-based methods ignore the raw audio-sensory information completely. The basic speech feature engineering method misses the benefits of transfer learning from rigorously pretrained speech DL models. Only recently, there has been some work on encod- ing speech stimuli using speech DL models [12]–[15] but they experiment with a single model to very few models each. In this work, we experiment with 30 (including 30 DL- based) speech models grouped into four types. Specifically for DL models, we use them in probe mode to encode speech stimuli. The encodings are then used to train ridge regression (as customary in brain encoding literature) models. We evaluate the accuracy using the Pearson correlation coeffi- cient (PCC) on Moth Radio Hour [10], which is a BOLD (blood-oxygen-level-dependent) fMRI (functional magnetic resonance imaging) dataset. Overall, we make the following contributions in this paper. To the best of our knowledge, we are the first to perform an extensive study for brain encoding using DL-based speech models. We evaluate 30 speech models grouped into four types against a popular BOLD fMRI dataset. Both contrastive and predictive self-supervised models are better than other types. Data2Vec [16] provides PCC of 0.268 for auditory areas, & 0.176 for language areas, which are significantly better than Wav2vec2.0 [17], thus establishing a new state-of-the-art. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10096248