NEURAL ARCHITECTURE OF SPEECH
Subba Reddy Oota
1,2
, Khushbu Pahwa
3
, Mounika Marreddy
2
, Manish Gupta
2,4
, Bapi S. Raju
2
1
Inria Bordeaux, France,
2
IIIT Hyderabad,
3
University of California, LA,
4
Microsoft
ABSTRACT
A vast literature on brain encoding has effectively harnessed
deep neural network models for accurately predicting brain
activations from visual or text stimuli. Unfortunately, there
is not much work on brain encoding for speech stimuli.
The few existing studies on brain encoding for speech
stimuli transcribe speech to text and then leverage text-
only models for encoding, thereby ignoring audio signals
completely. However, recently several speech representation
learning models have revolutionized the field of speech
processing. Inspired by the recent progress on deep learning
models for speech, we present a first systematic study on
understanding human speech processing by probing neural
speech models to predict both language and auditory brain
region activations. In particular, we investigate 30 speech
representation models grouped into four categories: (i) tradi-
tional feature engineering, (ii) generative, (iii) predictive, and
(iv) contrastive, to study how these models encode the speech
stimuli and align with human brain activity for the Moth
Radio Hour fMRI (functional magnetic resonance imaging)
dataset. We find that both contrastive (Wav2Vec2.0) and
predictive models (HuBERT, Data2Vec) are very accurate.
Specifically, Data2Vec aligns the best with both language
and auditory brain regions among all investigated models.
We make our code publicly available
1
.
Index Terms—brain encoding for speech, deep learning
speech models, Data2Vec
I. INTRODUCTION
In computational cognitive science, brain encoding is the
problem of predicting brain activations from stimuli [1]. For
the past two decades, researchers have focused on mapping
stimulus representations to brain activations through encod-
ing models for text and vision. For text, researchers have
explored both syntactic as well as semantic representations,
the most recent ones using Transformer-based deep learning
(DL) methods [2]–[4]. For vision, the encoding models have
leveraged our algorithmic understanding of visual hierarchy
(V1, V2, V4, IT) in the visual cortex and hence follow the
convolutional neural network-based design [5], [6].
Recently, several deep learning models have been shown
to be very effective for speech-processing tasks like speech
recognition, speech synthesis, speaker recognition, etc [7].
1
https://github.com/kpahwa16/Neural-Architecture-of-Speech
Speech Stimulus Speech Stimulus
Ridge
Regression
(W)
Pearson Correlation(PCC) = corr(Y, XW))
Speech Models
• Contrastive
• Generative
• Predictive
Speech
stimulus
Representation
(X)
fMRI brain
activity
recording
(Y)
Fig. 1. Broad architecture of studying alignment of neural
encoding models for speech with fMRI brain activations.
While there has been a rigorous evaluation of various deep
learning-based text [8] and vision [9] stimuli representa-
tions for brain encoding, speech stimuli have mostly been
represented using encodings of text transcriptions [10] or
using basic features like phoneme rate, the sum of squared
FFT coefficients [11], etc. Text transcription-based methods
ignore the raw audio-sensory information completely. The
basic speech feature engineering method misses the benefits
of transfer learning from rigorously pretrained speech DL
models. Only recently, there has been some work on encod-
ing speech stimuli using speech DL models [12]–[15] but
they experiment with a single model to very few models
each.
In this work, we experiment with 30 (including 30 DL-
based) speech models grouped into four types. Specifically
for DL models, we use them in probe mode to encode speech
stimuli. The encodings are then used to train ridge regression
(as customary in brain encoding literature) models. We
evaluate the accuracy using the Pearson correlation coeffi-
cient (PCC) on Moth Radio Hour [10], which is a BOLD
(blood-oxygen-level-dependent) fMRI (functional magnetic
resonance imaging) dataset.
Overall, we make the following contributions in this paper.
• To the best of our knowledge, we are the first to perform
an extensive study for brain encoding using DL-based
speech models.
• We evaluate 30 speech models grouped into four types
against a popular BOLD fMRI dataset.
• Both contrastive and predictive self-supervised models
are better than other types. Data2Vec [16] provides PCC
of 0.268 for auditory areas, & 0.176 for language areas,
which are significantly better than Wav2vec2.0 [17],
thus establishing a new state-of-the-art.
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10096248