TRANSFORMER-BASED ESTIMATION OF SPOKEN SENTENCES USING
ELECTROCORTICOGRAPHY
Shuji Komeiji
1
, Kai Shigemi
1
, Takumi Mitsuhashi
2
, Yasushi Iimura
2
,
Hiroharu Suzuki
2
, Hidenori Sugano
2
, Koichi Shinoda
3
, and Toshihisa Tanaka
1
1
Department of Electronic and Information Engineering, Tokyo University of Agriculture and Technology
2
Department of Neurosurgery, Juntendo University School of Medicine
3
Department of Computer Science, Tokyo Institute of Technology
ABSTRACT
Invasive brain–machine interfaces (BMIs) are a promising neu-
rotechnological venture for achieving direct speech communication
from a human brain, but it faces many challenges. In this paper,
we measured the invasive electrocorticogram (ECoG) signals from
seven participating epilepsy patients as they spoke a sentence con-
sisting of multiple phrases. A Transformer encoder was incorporated
into a “sequence-to-sequence” model to decode spoken sentences
from the ECoG. The decoding test revealed that the use of the
Transformer model achieved a minimum phrase error rate (PER) of
16.4%, and the median (± standard deviation) across seven partici-
pants was 31.3% (±10.0%). Moreover, the proposed model with the
Transformer achieved significantly better decoding accuracy than a
conventional long short-term memory model.
Index Terms— Electrocorticogram (ECoG), Brain–machine in-
terface (BMI), Transformer encoder, Sequence to sequence
1. INTRODUCTION
Brain–machine interfacing (BMI), which enables speech to be de-
coded from human thought, is expected to be used not only by apha-
sic patients but also as a new communication tool in the future [1].
Several techniques to achieve such BMI, using an invasive electro-
corticogram (ECoG) measured by electrodes implanted in the skull,
are under development. The ECoG is superior to a surface elec-
troencephalography in terms of spatio and temporal resolution and
signal-to-noise ratio; it is particularly suitable for analyzing brain
activity related to speech in the high gamma band [2, 3].
A variety of forms of speech decoding using ECoG have been
studied, from phoneme-based decoding to sentence-based decoding,
and for speaking, listening, and imagining brain activities. To enable
isolated-word speech decoding, Pei et al. used a naive Bayes classi-
fier for speaking and imagining tasks [3], and Martin et al. used a
support vector machine for speaking, listening, and imagining tasks
[4]. To enable sentence-based speech decoding, Viterbi decoding
with a hidden Markov model was applied by Herff et al. for a speak-
ing task [5] and by Moses et al. for a listening task [6, 7].
With the advent of deep-learning techniques, recurrent neural
networks (RNNs) have been applied to decoding speech from ECoG
signals. Sun et al. used a combination of a long short-term mem-
ory (LSTM) RNN model [8] and a connectionist temporal classifi-
cation decoder [9] for speaking and imagining tasks [10]. Makin et
al. successfully applied a “sequence-to-sequence” model, composed
This work was supported in part by JSPS KAKENHI 20H00235.
of an encoder stage and a decoder stage with bidirectional LSTM
(BLSTM) for speaking tasks [11]. Moreover, an effective method in
training the network with a limited amount of ECoG data has been
proposed wherein the network intermediate layer is trained to output
speech-latent features with lower dimensions than the input features
[11, 12, 13]. Sun et al. [10] and Makin et al. [11] trained LSTM lay-
ers in a sequence-to-sequence encoder using mutually synchronized
ECoG signals and Mel frequency cepstral coefficients (MFCCs) as
inputs and outputs, respectively.
However, it has been generally known that LSTM has drawbacks
in learning longer-range dependencies between the input and output
sequences. Thus, a so-called Transformer model [14] has been suc-
cessfully applied in natural language processing (NLP) [15] and au-
tomatic speech recognition (ASR) [16]. We therefore hypothesize
that the Transformer works efficiently in decoding spoken sentences
from the ECoG. This paper is the first to report an invasive BMI
that decodes speech using a Transformer embedded in the encoder
stage of a sequence-to-sequence model to decode spoken sentences
from ECoG signals. The experimental results obtained from seven
participants performing the speaking task showed that the proposed
model with the Transformer achieved significantly better decoding
accuracy than a conventional BLSTM.
2. METHODS
2.1. Participants
The seven volunteer participants (four males: js1, js5, js6, and
js8; three females: js3, js4, and js7) in this study were undergoing
treatment for epilepsy at the Department of Neurosurgery, Juntendo
University Hospital. ECoG arrays were surgically implanted on
each participant’s cortical surface (left hemisphere) to localize their
seizure foci. The participants gave written informed consent to par-
ticipate in this study, which was executed according to a protocol
approved by Juntendo University Hospital and the Tokyo University
of Agriculture and Technology.
2.2. Experimental Design
ECoGs were recorded for the speaking task, wherein the participants
read sentences displayed on a monitor aloud. Each sentence was
in Japanese and consisted of three phrases. Each phrase had two
candidates to generate one sentence as described in the following:
The first phrase was either “watashiwa” (I)
1
or “kimit
1
Japanese pronunciation with the corresponding English translation in
parentheses.
1311 978-1-6654-0540-9/22/$31.00 ©2022 IEEE ICASSP 2022
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9747443