Online Incremental Learning for Speaker-Adaptive Language Models Chih Chi Hu, Bing Liu, John Paul Shen, Ian Lane Electrical and Computer Engineering, Carnegie Mellon University, USA {chihhu,liubing,jpshen,lane}@cmu.edu Abstract Voice control is a prominent interaction method on personal computing devices. While automatic speech recognition (ASR) systems are readily applicable for large audiences, there is room for further adaptation at the edge, ie. locally on devices, targeted for individual users. In this work, we explore improving ASR systems over time through a user’s own interactions. Our on- line learning approach for speaker-adaptive language modeling leverages a user’s most recent utterances to enhance the speaker dependent features and traits. We experiment with the Large- Vocabulary Continuous Speech Recognition corpus Tedlium v2, and demonstrate an average reduction in perplexity (PPL) of 19.18% and average relative reduction in word error rate (WER) of 2.80% compared to a state-of-the-art baseline on Tedlium v2. Index Terms: Automatic Speech Recognition, Online Learning, Language Modeling, Speaker-Adaptation, Speaker- Specific Modeling, Recurrent Neural Networks 1. Introduction Voice control is becoming an increasingly popular interaction method on personal computing devices, where interactions are limited to a single or a handful of users. Phones, laptops, and even vehicles, now support services in providing personalized recommendations and advertisements according to user’s in- terests and needs. Speech recognition is one such area that can be leveraged to build user profiles and through real-time speaker adaption provide further enhancements based on user specific phrases, usage, and style. Voice assistants such as Ap- ple Siri, Google Now, Microsoft Cortana, and Amazon Alexa, could provide a better interactive experience for all users if they could learn from its users through interactions and its own hy- potheses (or references if available). We focus on adapting to the syntactic, semantic, and pragmatic characteristics of speech, which by design is captured by the language model. By lever- aging the user’s most recent utterances, we enhance speaker de- pendent features and traits in the recurrent neural network lan- guage model as well as implicitly capture the context to improve speech recognition for designated user(s) over time. Prior work in speaker adaptation has broadly explored fine tuning or freezing various parameters or components in the ASR model. We utilize a simple approach of continuous mini batch training with varying epochs and batch sizes. With small defined epochs, a standard training strategy, and continuous on- line learning, our experiments show positive improvement for individual speakers over time. In this paper, we use online incremental learning for lan- guage model speaker adaptation to improve performance and enhance robustness of automatic speech recognition systems. We train a state-of-the-art RNN language model, then during live inferences, we re-train on mini batches of utterances in an incremental and continuous fashion in real time. After each segment, an updated model is produced to be evaluated on the next segment. We utilize both ASR hypotheses and reference utterances for mini batch training to explore the effectiveness of online incremental learning. With online learning on reference utterances, our RNN model obtained an average of 20.09% re- duction in PPL (Figure 2) and a reduction of up to 0.75% in absolute WER (Figure 1), corresponding to a relative WER re- duction of 9.93%. Furthermore, online learning with references shows additional 0.98% relative improvement in average PPL and 0.10% absolute improvement in average WER on top of online learning with ASR hypotheses. Our main contribution is to show that through online incre- mental learning ASR systems can adapt to users over time and show significant improvements in PPL along with reductions in WER. This improvement is consistent for online incremental learning on ASR hypotheses and references across three state- of-the-art baseline acoustic models. The paper is organized as follows. In Section 2 we intro- duce relevant work for speaker-adaptive acoustic modelling and language modeling. We then describe the baseline language model and our online incremental learning methodology in Sec- tion 3. In Section 4, we discuss the experimental setup, results, and findings of online incremental learning applied to speech recognition systems. Finally, in Section 5 we conclude our work and discuss future ideas. 2. Related Work Language model adaption has been widely studied in literature. Hsu et al. [1] explored the iterative use of the ASR hypothe- ses for unsupervised parameter estimation for n-gram language models. Similarly, [2, 3, 4] proposed unsupervised adaptation methods for presentation lecture speech recognition. Recent work on RNNLM adaptation [5, 6] explored using utterance topic information extracted with Latent Dirichlet Allocation and Hierarchical Dirichlet Processes in adapting language models for multi-genre broadcast transcription tasks. These works re- ported significant perplexity reduction by doing RNNLM adap- tation, and small (0.1-0.2%) reduction on WER. Deena et al. [7] studied RNNLM adaptation with combined feature and model- based adaptation. Gangireddy et al. [8] explored model-based adaptation by scaling forward-propagated hidden activations and direct fine-tuning all RNNLM parameters. Ma et al.[9] proposed adapting the softmax layer of the neural network and showed improved performance on both perplexity and WER. Another line of research that is closely related to our work is the cache language model [10, 11]. A cache language model stores a representation of the recent text and uses it for next word prediction. Jelinek et al. [12] proposed a cache trigram language model by using trigram frequencies estimated from the recent history of words. Grave et al. [13] proposed RNNLM with a continuous cache to adapt the word prediction to the re- cent history by storing past hidden activations as memory. They showed effective reductions of language model perplexity with the cache model and smaller reductions on WER. Interspeech 2018 2-6 September 2018, Hyderabad 3363 10.21437/Interspeech.2018-2259