TRANSFORMER-BASED ESTIMATION OF SPOKEN SENTENCES USING ELECTROCORTICOGRAPHY Shuji Komeiji 1 , Kai Shigemi 1 , Takumi Mitsuhashi 2 , Yasushi Iimura 2 , Hiroharu Suzuki 2 , Hidenori Sugano 2 , Koichi Shinoda 3 , and Toshihisa Tanaka 1 1 Department of Electronic and Information Engineering, Tokyo University of Agriculture and Technology 2 Department of Neurosurgery, Juntendo University School of Medicine 3 Department of Computer Science, Tokyo Institute of Technology ABSTRACT Invasive brain–machine interfaces (BMIs) are a promising neu- rotechnological venture for achieving direct speech communication from a human brain, but it faces many challenges. In this paper, we measured the invasive electrocorticogram (ECoG) signals from seven participating epilepsy patients as they spoke a sentence con- sisting of multiple phrases. A Transformer encoder was incorporated into a “sequence-to-sequence” model to decode spoken sentences from the ECoG. The decoding test revealed that the use of the Transformer model achieved a minimum phrase error rate (PER) of 16.4%, and the median (± standard deviation) across seven partici- pants was 31.3% (±10.0%). Moreover, the proposed model with the Transformer achieved signiﬁcantly better decoding accuracy than a conventional long short-term memory model. Index Terms— Electrocorticogram (ECoG), Brain–machine in- terface (BMI), Transformer encoder, Sequence to sequence 1. INTRODUCTION Brain–machine interfacing (BMI), which enables speech to be de- coded from human thought, is expected to be used not only by apha- sic patients but also as a new communication tool in the future [1]. Several techniques to achieve such BMI, using an invasive electro- corticogram (ECoG) measured by electrodes implanted in the skull, are under development. The ECoG is superior to a surface elec- troencephalography in terms of spatio and temporal resolution and signal-to-noise ratio; it is particularly suitable for analyzing brain activity related to speech in the high gamma band [2, 3]. A variety of forms of speech decoding using ECoG have been studied, from phoneme-based decoding to sentence-based decoding, and for speaking, listening, and imagining brain activities. To enable isolated-word speech decoding, Pei et al. used a naive Bayes classi- ﬁer for speaking and imagining tasks [3], and Martin et al. used a support vector machine for speaking, listening, and imagining tasks [4]. To enable sentence-based speech decoding, Viterbi decoding with a hidden Markov model was applied by Herff et al. for a speak- ing task [5] and by Moses et al. for a listening task [6, 7]. With the advent of deep-learning techniques, recurrent neural networks (RNNs) have been applied to decoding speech from ECoG signals. Sun et al. used a combination of a long short-term mem- ory (LSTM) RNN model [8] and a connectionist temporal classiﬁ- cation decoder [9] for speaking and imagining tasks [10]. Makin et al. successfully applied a “sequence-to-sequence” model, composed This work was supported in part by JSPS KAKENHI 20H00235. of an encoder stage and a decoder stage with bidirectional LSTM (BLSTM) for speaking tasks [11]. Moreover, an effective method in training the network with a limited amount of ECoG data has been proposed wherein the network intermediate layer is trained to output speech-latent features with lower dimensions than the input features [11, 12, 13]. Sun et al. [10] and Makin et al. [11] trained LSTM lay- ers in a sequence-to-sequence encoder using mutually synchronized ECoG signals and Mel frequency cepstral coefﬁcients (MFCCs) as inputs and outputs, respectively. However, it has been generally known that LSTM has drawbacks in learning longer-range dependencies between the input and output sequences. Thus, a so-called Transformer model [14] has been suc- cessfully applied in natural language processing (NLP) [15] and au- tomatic speech recognition (ASR) [16]. We therefore hypothesize that the Transformer works efﬁciently in decoding spoken sentences from the ECoG. This paper is the ﬁrst to report an invasive BMI that decodes speech using a Transformer embedded in the encoder stage of a sequence-to-sequence model to decode spoken sentences from ECoG signals. The experimental results obtained from seven participants performing the speaking task showed that the proposed model with the Transformer achieved signiﬁcantly better decoding accuracy than a conventional BLSTM. 2. METHODS 2.1. Participants The seven volunteer participants (four males: js1, js5, js6, and js8; three females: js3, js4, and js7) in this study were undergoing treatment for epilepsy at the Department of Neurosurgery, Juntendo University Hospital. ECoG arrays were surgically implanted on each participant’s cortical surface (left hemisphere) to localize their seizure foci. The participants gave written informed consent to par- ticipate in this study, which was executed according to a protocol approved by Juntendo University Hospital and the Tokyo University of Agriculture and Technology. 2.2. Experimental Design ECoGs were recorded for the speaking task, wherein the participants read sentences displayed on a monitor aloud. Each sentence was in Japanese and consisted of three phrases. Each phrase had two candidates to generate one sentence as described in the following: The ﬁrst phrase was either “watashiwa” (I) 1 or “kimit 1 Japanese pronunciation with the corresponding English translation in parentheses. 1311 978-1-6654-0540-9/22/$31.00 ©2022 IEEE ICASSP 2022 ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9747443