Forget a Bit to Learn Better: Soft Forgetting for CTC-based Automatic Speech Recognition Kartik Audhkhasi, George Saon, Zolt´ an T¨ uske, Brian Kingsbury, Michael Picheny IBM Research AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598 {kaudhkha,gsaon,zoltan.tuske,bedk,picheny}@us.ibm.com Abstract Prior work has shown that connectionist temporal classifica- tion (CTC)-based automatic speech recognition systems per- form well when using bidirectional long short-term memory (BLSTM) networks unrolled over the whole speech utterance. This is because whole-utterance BLSTMs better capture long- term context. We hypothesize that this also leads to overfit- ting and propose soft forgetting as a solution. During training, we unroll the BLSTM network only over small non-overlapping chunks of the input utterance. We randomly pick a chunk size for each batch instead of a fixed global chunk size. In order to retain some utterance-level information, we encourage the hidden states of the BLSTM network to approximate those of a pre-trained whole-utterance BLSTM. Our experiments on the 300-hour English Switchboard dataset show that soft forgetting improves the word error rate (WER) above a competitive whole- utterance phone CTC BLSTM by an average of 7-9% relative. We obtain WERs of 9.1%/17.4% using speaker-independent and 8.7%/16.8% using speaker-adapted models respectively on the Hub5-2000 Switchboard/CallHome test sets. We also show that soft forgetting improves the WER when the model is used with limited temporal context for streaming recognition. Fi- nally, we present some empirical insights into the regularization and data augmentation effects of soft forgetting. Index Terms: automatic speech recognition, connectionist tem- poral classification, regularization 1. Introduction End-to-end (E2E) automatic speech recognition (ASR) sys- tems [1–20] have been the subject of significant recent research. Such systems aim to simplify the complex training and infer- ence pipelines of conventional hybrid ASR systems [21, 22]. Hybrid systems combine Gaussian mixture models, hidden Markov models (HMMs) and various neural networks, and in- volve multiple stages of model building and alignment between the sequence of speech features and HMM context-dependent states. In contrast, end-to-end systems use recurrent neural net- works (RNNs) and train the acoustic model in one-shot by either summing over all alignments through the connectionist tempo- ral classification (CTC) loss [1] or learning the optimal align- ment through an attention mechanism [14, 16]. The word error rate (WER) gap between E2E and hybrid ASR systems has re- duced over time as shown by several works. RNNs with long short-term memory (LSTM) [23] hidden units are the neural networks of choice for ASR systems. Bidi- rectional LSTM (BLSTM) networks are especially popular and consist of two LSTM networks at each layer that are unrolled forward and backward in time. For E2E ASR systems, the BLSTM network is unrolled over the entire length of the speech utterance. Whole-utterance unrolling enables a BLSTM net- work to better capture long-term context, which is especially useful given the lack of alignments. The control of remember- ing long-term context is left to the four trainable gates (input, forget, cell, and output) of the LSTM cell. Prior work has ex- plored variants of the LSTM to control this behavior [24–26]. We hypothesize that whole-utterance unrolling of the BLSTM network leads to overfitting, even in the presence of well-known regularization techniques such as dropout [27]. This is especially detrimental to the WER of E2E ASR sys- tems given limited training data, e.g. a few hundred hours of speech. We propose soft forgetting to combat this overfit- ting. First, we unroll the BLSTM network only over small non-overlapping chunks of the input acoustic utterance instead of the whole utterance. The hidden and cell states of the for- ward and backward LSTM networks are reset to zero at chunk boundaries. In order to prevent memorization of the fixed- sized chunk, we randomly perturb the chunk size across batches during training. Finally, we use twin regularization [29, 30] in order to retain some utterance-level context. Twin regular- ization adds the mean-squared error between the hidden states of the chunk-based BLSTM network and a pre-trained whole- utterance BLSTM network to the CTC loss. Since twin regular- ization promotes some remembering of context across chunks, we call our approach soft forgetting. To the best of our knowledge, prior works have considered chunked-training of CTC ASR models primarily for stream- ing inference. For example, [31] gives an overview of tempo- ral chunking, latency-controlled bidirectional RNNs [32], and lookahead convolutions [33]. However, soft forgetting addition- ally incorporates chunk jitter and twin regularization, and sig- nificantly improves both the offline/non-streaming and stream- ing WER of the CTC ASR system. We conduct experiments on the 300-hour English Switchboard data set and show that soft forgetting significantly improves the WER by 7-9% rela- tive over a competitive phone CTC baseline across several test sets. We also present empirical evidence for the regularization and data augmentation effects of soft forgetting. 2. Soft Forgetting Before discussing soft forgetting, we first give a brief overview of twin regularization [29,30] for its original application of clos- ing the WER gap between uni-directional LSTM (ULSTM) and BLSTM networks. ULSTM networks lag BLSTM networks in terms of ASR WER because ULSTM networks only incorpo- rate forward-in-time context whereas BLSTM networks addi- tionally incorporate backward-in-time context. In its original formulation, twin regularization jointly trains two ULSTM net- works operating forward and backward-in-time independently. The overall training loss is Ltot (y|x, Θf , Θb)= LCTC-f (y|x, Θf )+ LCTC-b(y|x, Θb) + λLtwin(hf , hb|Θf , Θb) (1) Copyright 2019 ISCA INTERSPEECH 2019 September 15–19, 2019, Graz, Austria http://dx.doi.org/10.21437/Interspeech.2019-2841 2618