arXiv:2112.03871v1 [eess.AS] 7 Dec 2021 TRAINING END-TO-END SPEECH-TO-TEXT MODELS ON MOBILE PHONES S. Zitha, Raghavendra Rao Suresh, Pooja Rao, T. V. Prabhakar Department of Electronic Systems Engineering Indian Institute of Science, Bengaluru, India, 560012. Email:{zithas,poojarao,tvprabs}@iisc.ac.in, raghavendrasureshk10@gmail.com ABSTRACT Training the state-of-the-art speech-to-text (STT) models in mobile devices is challenging due to its limited resources rel- ative to a server environment. In addition, these models are trained on generic datasets that are not exhaustive in captur- ing user-specific characteristics. Recently, on-device person- alization techniques have been making strides in mitigating the problem. Although many current works have already ex- plored the effectiveness of on-device personalization, the ma- jority of their findings are limited to simulation settings or a specific smartphone. In this paper, we develop and provide a detailed explanation of our framework to train end-to-end models in mobile phones. To make it simple, we considered a model based on connectionist temporal classification (CTC) loss. We evaluated the framework on various mobile phones from different brands and reported the results. We provide enough evidence that fine-tuning the models and choosing the right hyperparameter values is a trade-off between the lowest WER achievable, training time on-device, and memory con- sumption. Hence, this is vital for a successful deployment of on-device training onto a resource-limited environment like mobile phones. We use training sets from speakers with dif- ferent accents and record a 7.6% decrease in average word er- ror rate (WER). We also report the associated computational cost measurements with respect to time, memory usage, and cpu utilization in mobile phones in real-time. Index Terms— on-device training, personalization, speech recognition, model adaptation, on-device learning 1. INTRODUCTION AND MOTIVATION Supervised training of end-to-end speech-to-text (STT) or automatic speech recognition (ASR) models require a signif- icantly large amount of annotated and transcribed audio data. The commercial STT systems [1] are mostly deployed on the centralised cloud with network support infrastructure. With the ubiquitous mobile phone and other embedded devices, there is a demand for reliable and fast loop control edge in- telligence based decoding without network infrastructure. In recent times, there is an emergence of distributed training paradigms for STT models that use the rich data collected directly on mobile and other embedded devices to further train the pre-trained models. Such a method not only reduces privacy risks, but also improves the response time by fine tuning the model parameters. Furthermore, such systems are constrained by memory and storage ability, decoding speed, availability and reliability of on-device training data and cheap sensor hardware. Hence building end-to-end STT models for such lightweight systems is a challenge in itself. STT personalization [2] using acoustic model adaptation aims at fine-tuning the leaner weights of the model to learn the user’s voice characteristics such as pitch, accent, and speak- ing rate in a resource-limited environment like smartphones. In [3], the authors used a sliding window model to simulate the data consumption as in a mobile environment. The pre- trained recurrent neural network transducer (RNNT) [4, 5, 6, 7] model weights from the server are quantized to 8-bit in- tegers for deployment. They suggest an approach to mini- mize memory consumption by splitting the training graph into smaller sub-graphs such that the gradients are calculated sep- arately for the sub-graphs. [8] proposes an approach to im- prove the performance degradation for disordered speech us- ing personalization of the STT models. Both of these studies report the results of a small user study of the personalization strategy on a Pixel 3 phone. However, the manner in which the on-device training was carried out using the quantified model deployed over the phone is not elaborated anywhere. In [9], the authors investigate the effectiveness of fine-tuning the decoder of the RNN-T model to recognize named entities better. To prevent the personalized model from being indis- criminately accepted, a set of acceptance criteria based on loss and WER on validation data was implemented by [10], but the performance of STT on-device training was not reported for both works. The works reported above perform a quantizing and dequantizing of the model weights in between training rounds. Unlike the above discussed approaches, in our work, we persist with the model weights in the quantized form dur- ing training, which is desirable for storage efficiency. Many such approaches focused on reporting acoustic model adap- tation in simulation setting, but the deployment constraints