NON-AUTOREGRESSIVE END-TO-END AUTOMATIC SPEECH RECOGNITION
INCORPORATING DOWNSTREAM NATURAL LANGUAGE PROCESSING
Motoi Omachi
1
, Yuya Fujita
1
, Shinji Watanabe
2
,Tianzi Wang
3
1
Yahoo Japan Corporation,
2
Carnegie Mellon University,
3
Johns Hopkins University
ABSTRACT
We propose a fast and accurate end-to-end (E2E) model, which ex-
ecutes automatic speech recognition (ASR) and downstream natural
language processing (NLP) simultaneously. The proposed approach
predicts a single-aligned sequence of transcriptions and linguistic an-
notations such as part-of-speech (POS) tags and named entity (NE)
tags from speech. We use non-autoregressive (NAR) decoding in-
stead of autoregressive (AR) decoding to reduce execution time since
NAR can output multiple tokens in parallel across time. We use
the connectionist temporal classification (CTC) model with mask-
predict, i.e., Mask-CTC, to predict the single-aligned sequence ac-
curately. Mask-CTC improves performance by joint training of CTC
and a conditioned masked language model and refining output tokens
with low confidence conditioned on reliable output tokens and audio
embeddings. The proposed method jointly performs the ASR and
downstream NLP task, i.e., POS or NE tagging, in a NAR manner.
Experiments using the Corpus of Spontaneous Japanese and Spoken
Language Understanding Resource Package show that the proposed
E2E model can predict transcriptions and linguistic annotations with
consistently better performance than vanilla CTC using greedy de-
coding and 15–97x faster than Transformer-based AR model.
Index Terms— Speech recognition, natural language process-
ing, linguistic annotation, end-to-end, non-autoregressive
1. INTRODUCTION
The end-to-end (E2E) model, which predicts an output sequence
from an input feature sequence with a single neural network (NN),
has been frequently used in automatic speech recognition (ASR) [1–
7]. In ASR, the E2E model predicts grapheme/multi-graphemic se-
quences and generates transcriptions. However, in spoken language
understanding, not only transcriptions but also linguistic annotations
such as phonemes and part-of-speech (POS) tags are helpful [8].
The recently proposed E2E model predicts transcriptions and
such linguistic annotations jointly [9–12]. For example, an E2E
model [12] predicts transcriptions and phonemic sequences using
one-to-many sequence mapping. However, it requires additional
alignment post-processing to obtain the correspondence between
phonemic sequence and grapheme sequences. As another exam-
ple, E2E models [9–11] output a serialized sequence consisting
of graphemic/multi-graphemic units followed by named entity (NE)
tags, phonemes, or other linguistic annotations, as shown in Figure 1.
This model does not require additional alignment post-processing.
In these studies, the connectionist temporal classification (CTC)
is frequently used. However, CTC is inadequate for predicting
transcriptions and additional information due to strong conditional
independent assumption. CTC ignores explicit relations among
output tokens even though these tokens are related to each other.
We recently proposed a Transformer-based E2E model in-
stead of the CTC model for predicting an aligned sequence of the
graphemic unit, phonemes, and POS tags [13]. Transformer is rea-
sonable for predicting sequences whose output tokens are related
to each other because it does not use any conditional independence
assumption. However, Transformer has room for improvement in
execution time. Transformer requires at least N iterations to predict
N -length output. Since the length of a serialized sequence com-
prising multiple types of tokens tends to be longer, Transformer’s
execution time increases. This paper aims to develop an E2E model
predicting transcriptions and linguistic annotations simultaneously
faster than Transformer with sufficient performance.
The E2E models used in ASR can be categorized into autore-
gressive (AR) [2–7] and non-autoregressive (NAR) [1,14–19] mod-
els, and there are trade-offs between ASR performance and execu-
tion time. AR models, such as Transformer, achieve better ASR
performance compared to NAR models at the expense of higher ex-
ecution times. On the other hand, the NAR models, such as CTC
using greedy decoding, predict the output sequence faster than the
AR model thanks to the parallel computation across time. How-
ever, the NAR model performance, especially CTC performance, is
worse than the AR model due to the lack of explicit output token
dependency. To improve the performance, Mask-CTC applying the
mask-predict [14, 20] to the CTC output is proposed [19]. Mask-
CTC refines unreliable output tokens conditioned on reliable output
tokens and audio embeddings using Decoder, which is not used for
other NAR models [16–18].
This paper proposes using Mask-CTC [19] to simultaneously
predict transcripts and linguistic annotations instead of using an AR
model like Transformer. We expect Mask-CTC Decoder, i.e., condi-
tional masked language model (CMLM), to capture explicit output
token dependency. This is suitable for predicting sequences whose
output tokens are related to each other, e.g., a single-aligned se-
quence of graphemes and linguistic annotations used in this study.
This study is the first attempt to use the Mask-CTC for predict-
ing sequences comprising multiple type tokens in a serialized for-
mat. We expect that the NAR model achieves faster prediction than
Transformer-based AR model because it can predict multiple to-
kens in parallel across time. Also, the ASR performance is im-
proved compared to vanilla CTC using greedy decoding since mask-
predict refines CTC output errors by considering the conditional de-
pendence. Since the target sequence comprises of multiple type to-
kens (e.g., graphemes, phonemes, and POS tags), we propose a type-
wise mask-predict algorithm that iteratively updates one of the token
types (e.g., graphemes only) for each iteration.
Experimental results showed our mask-pridict-based NAR
model outperforms vanilla CTC using greedy decoding in ASR
and is faster than Transformer using AR decoding. In addition, we
found our model works adequately in downstream natural language
processing (NLP) tasks, i.e., POS tagging and NE tagging.
6772 978-1-6654-0540-9/22/$31.00 ©2022 IEEE ICASSP 2022
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9746067