arXiv:1909.06522v1 [eess.AS] 14 Sep 2019 MULTILINGUAL ASR WITH MASSIVE DATA AUGMENTATION Chunxi Liu, Qiaochu Zhang, Xiaohui Zhang, Kritika Singh, Yatharth Saraf, Geoffrey Zweig Facebook Inc., New York, NY, and Menlo Park, CA, USA {chunxiliu,frankz,xiaohuizhang,skritika,ysaraf,gzweig}@fb.com ABSTRACT Towards developing high-performing ASR for low-resource lan- guages, approaches to address the lack of resources are to make use of data from multiple languages, and to augment the training data by creating acoustic variations. In this work we present a single grapheme-based ASR model learned on 7 geographically proximal languages, using standard hybrid BLSTM-HMM acous- tic models with lattice-free MMI objective. We build the single ASR grapheme set via taking the union over each language-specific grapheme set, and we find such multilingual ASR model can per- form language-independent recognition on all 7 languages, and substantially outperform each monolingual ASR model. Secondly, we evaluate the efficacy of multiple data augmentation alternatives within language, as well as their complementarity with multilingual modeling. Overall, we show that the proposed multilingual ASR with various data augmentation can not only recognize any within training set languages, but also provide large ASR performance improvements. Index TermsMultilingual acoustic modeling, data augmenta- tion 1. INTRODUCTION It can be challenging to build high-accuracy automatic speech recog- nition (ASR) systems in real world due to the vast language diversity and the requirement of extensive manual annotations on which the ASR algorithms are typically built. Series of research efforts have thus far been focused on guiding the ASR of a target language by using the supervised data from multiple languages. Consider the standard hidden Markov models (HMM) based ASR system with a phonemic lexicon, where the vocabulary is spec- ified by a pronunciation lexicon. One popular strategy is to make all languages share the same phonemic representations through a universal phonetic alphabet such as International Phonetic Alphabet (IPA) phone set [1, 2, 3, 4], or X-SAMPA phone set [5, 6, 7, 8]. In this case, multilingual joint training can be directly applied. Given the effective neural network based acoustic modeling, another line of research is to share the hidden layers across multiple languages while the softmax layers are language dependent [9, 10]; such multi- task learning procedure can improve ASR accuracies for both within training set languages, and also unseen languages after language- specific adaptation, i.e., cross-lingual transfer learning. Different nodes in hidden layers have been shown in response to distinct pho- netic features [11], and hidden layers can be potentially transferable across languages. Note that the above works all assume the test The authors would like to thank Duc Le, Ching-Feng Yeh and Siddharth Shah, all with Facebook, for their invaluable infrastructure assistance and technical discussions. We also thank Yifei Ding and Daniel McKinnon, also at Facebook, for coordinating the ASR language expansion efforts. language identity to be known at decoding time, and the language specific lexicon and language model applied. In the absence of a phonetic lexicon, building graphemic sys- tems has shown comparable performance to phonetic lexicon-based approaches in extensive monolingual evaluations [12, 13, 14]. Re- cent advances in end-to-end ASR models have attempted to take the union of multiple language-specific grapheme (i.e. orthographic character) sets, and use such union as a universal grapheme set for a single sequence-to-sequence ASR model [15, 16, 17]. It allows for learning a grapheme-based model jointly on data from multiple languages, and performing ASR on within training set languages. In various cases it can produce performance gains over monolingual modeling that uses in-language data only. In our work, we aim to examine the same approach of build- ing a multilingual graphemic lexicon, while using a standard hybrid ASR system – based on Bidirectional Long Short-Term Memory (BLSTM) and HMM – learned with lattice-free maximum mutual information (MMI) objective [18]. Our initial attempt is on build- ing a single cascade of an acoustic model, a phonetic decision tree, a graphemic lexicon and a language model – for 7 geographically proximal languages that have little overlap in their character sets. We evaluate it in a low resource context where each language has around 160 hours training data. We find that, despite the lack of explicit lan- guage identification (ID) guidance, our multilingual model can ac- curately produce ASR transcripts in the correct test language scripts, and provide higher ASR accuracies than each language-specific ASR model. We further examine if using a subset of closely related lan- guages – along language family or orthography – can achieve the same performance improvements as using all 7 languages. We proceed with our investigation on various data augmentation techniques to overcome the lack of training data in the above low- resource setting. Given the highly scalable neural network acoustic modeling, extensive alternatives to increasing the amount or diver- sity of existing training data have been explored in prior works, e.g., applying vocal tract length perturbation and speed perturbation [19], volume perturbation and normalization [20], additive noises [21], re- verberation [20, 22, 23], and SpecAugment [24]. In this work we focus particularly on techniques that mostly apply to our wildly col- lected video datasets. In comparing their individual and complemen- tary effects, we aim to answer: (i) if there is benefit in scaling the model training to significantly larger quantities, e.g., up to 9 times greater than the original training set size, and (ii) if any, is the data augmentation efficacy comparable or complementary with the above multilingual modeling. Improving accessibility to videos “in the wild such as automatic captioning on YouTube has been studied in [25, 26]. While allowing for applications like video captions, indexing and retrieval, tran- scribing the heterogeneous Facebook videos of extensively diverse languages is highly challenging for ASR systems. On the whole, we present empirical studies in building a single multilingual ASR