NON-AUTOREGRESSIVE END-TO-END AUTOMATIC SPEECH RECOGNITION INCORPORATING DOWNSTREAM NATURAL LANGUAGE PROCESSING Motoi Omachi 1 , Yuya Fujita 1 , Shinji Watanabe 2 ,Tianzi Wang 3 1 Yahoo Japan Corporation, 2 Carnegie Mellon University, 3 Johns Hopkins University ABSTRACT We propose a fast and accurate end-to-end (E2E) model, which ex- ecutes automatic speech recognition (ASR) and downstream natural language processing (NLP) simultaneously. The proposed approach predicts a single-aligned sequence of transcriptions and linguistic an- notations such as part-of-speech (POS) tags and named entity (NE) tags from speech. We use non-autoregressive (NAR) decoding in- stead of autoregressive (AR) decoding to reduce execution time since NAR can output multiple tokens in parallel across time. We use the connectionist temporal classiﬁcation (CTC) model with mask- predict, i.e., Mask-CTC, to predict the single-aligned sequence ac- curately. Mask-CTC improves performance by joint training of CTC and a conditioned masked language model and reﬁning output tokens with low conﬁdence conditioned on reliable output tokens and audio embeddings. The proposed method jointly performs the ASR and downstream NLP task, i.e., POS or NE tagging, in a NAR manner. Experiments using the Corpus of Spontaneous Japanese and Spoken Language Understanding Resource Package show that the proposed E2E model can predict transcriptions and linguistic annotations with consistently better performance than vanilla CTC using greedy de- coding and 15–97x faster than Transformer-based AR model. Index Terms— Speech recognition, natural language process- ing, linguistic annotation, end-to-end, non-autoregressive 1. INTRODUCTION The end-to-end (E2E) model, which predicts an output sequence from an input feature sequence with a single neural network (NN), has been frequently used in automatic speech recognition (ASR) [1– 7]. In ASR, the E2E model predicts grapheme/multi-graphemic se- quences and generates transcriptions. However, in spoken language understanding, not only transcriptions but also linguistic annotations such as phonemes and part-of-speech (POS) tags are helpful [8]. The recently proposed E2E model predicts transcriptions and such linguistic annotations jointly [9–12]. For example, an E2E model [12] predicts transcriptions and phonemic sequences using one-to-many sequence mapping. However, it requires additional alignment post-processing to obtain the correspondence between phonemic sequence and grapheme sequences. As another exam- ple, E2E models [9–11] output a serialized sequence consisting of graphemic/multi-graphemic units followed by named entity (NE) tags, phonemes, or other linguistic annotations, as shown in Figure 1. This model does not require additional alignment post-processing. In these studies, the connectionist temporal classiﬁcation (CTC) is frequently used. However, CTC is inadequate for predicting transcriptions and additional information due to strong conditional independent assumption. CTC ignores explicit relations among output tokens even though these tokens are related to each other. We recently proposed a Transformer-based E2E model in- stead of the CTC model for predicting an aligned sequence of the graphemic unit, phonemes, and POS tags [13]. Transformer is rea- sonable for predicting sequences whose output tokens are related to each other because it does not use any conditional independence assumption. However, Transformer has room for improvement in execution time. Transformer requires at least N iterations to predict N -length output. Since the length of a serialized sequence com- prising multiple types of tokens tends to be longer, Transformer’s execution time increases. This paper aims to develop an E2E model predicting transcriptions and linguistic annotations simultaneously faster than Transformer with sufﬁcient performance. The E2E models used in ASR can be categorized into autore- gressive (AR) [2–7] and non-autoregressive (NAR) [1,14–19] mod- els, and there are trade-offs between ASR performance and execu- tion time. AR models, such as Transformer, achieve better ASR performance compared to NAR models at the expense of higher ex- ecution times. On the other hand, the NAR models, such as CTC using greedy decoding, predict the output sequence faster than the AR model thanks to the parallel computation across time. How- ever, the NAR model performance, especially CTC performance, is worse than the AR model due to the lack of explicit output token dependency. To improve the performance, Mask-CTC applying the mask-predict [14, 20] to the CTC output is proposed [19]. Mask- CTC reﬁnes unreliable output tokens conditioned on reliable output tokens and audio embeddings using Decoder, which is not used for other NAR models [16–18]. This paper proposes using Mask-CTC [19] to simultaneously predict transcripts and linguistic annotations instead of using an AR model like Transformer. We expect Mask-CTC Decoder, i.e., condi- tional masked language model (CMLM), to capture explicit output token dependency. This is suitable for predicting sequences whose output tokens are related to each other, e.g., a single-aligned se- quence of graphemes and linguistic annotations used in this study. This study is the ﬁrst attempt to use the Mask-CTC for predict- ing sequences comprising multiple type tokens in a serialized for- mat. We expect that the NAR model achieves faster prediction than Transformer-based AR model because it can predict multiple to- kens in parallel across time. Also, the ASR performance is im- proved compared to vanilla CTC using greedy decoding since mask- predict reﬁnes CTC output errors by considering the conditional de- pendence. Since the target sequence comprises of multiple type to- kens (e.g., graphemes, phonemes, and POS tags), we propose a type- wise mask-predict algorithm that iteratively updates one of the token types (e.g., graphemes only) for each iteration. Experimental results showed our mask-pridict-based NAR model outperforms vanilla CTC using greedy decoding in ASR and is faster than Transformer using AR decoding. In addition, we found our model works adequately in downstream natural language processing (NLP) tasks, i.e., POS tagging and NE tagging. 6772 978-1-6654-0540-9/22/$31.00 ©2022 IEEE ICASSP 2022 ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9746067