Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech Astik Biswas 1 , Febe de Wet 1 , Ewald van der Westhuizen 1 , Emre Yılmaz 2,3 , Thomas Niesler 1 1 Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa 2 CLS/CLST, Radboud University, Nijmegen, Netherlands 3 Dept. of Electrical and Computer Engineering, National University of Singapore, Singapore abiswas@sun.ac.za, fdw@sun.ac.za, ewaldvdw@sun.ac.za, e.yilmaz@let.ru.nl, trn@sun.ac.za Abstract Although isiZulu speakers code-switch with English as a mat- ter of course, extremely little appropriate data is available for acoustic modelling. Recently, a small ﬁve-language corpus of code-switched South African soap opera speech was com- piled. We used this corpus to evaluate the application of mul- tilingual neural network acoustic modelling to English-isiZulu code-switched speech recognition. Our aim was to determine whether English-isiZulu speech recognition accuracy can be improved by incorporating three other language pairs in the cor- pus: English-isiXhosa, English-Setswana and English-Sesotho. Since isiXhosa, like isiZulu, belongs to the Nguni language family, while Setswana and Sesotho belong to the more distant Sotho family, we could also investigate the merits of additional data from within and across language groups. Our experiments using both fully connected DNN and TDNN-LSTM architec- tures show that English-isiZulu speech recognition accuracy as well as language identiﬁcation after code-switching is improved more by the incorporation of English-isiXhosa data than by the incorporation of the other language pairs. However additional data from the more distant language group remained beneﬁcial, and the best overall performance was always achieved with a multilingual neural network trained on all four language pairs. Index Terms: code-switching, under-resourced languages, African languages, speech recognition, DNN, TDNN-LSTM. 1. Introduction With 11 ofﬁcial languages whose usage patterns overlap geo- graphically, South Africa has a highly multilingual population. As a consequence, it is common to use more than one lan- guage during discourse. This phenomenon is known as code- switching (CS) and can occur between sentences, within the same sentence, and even within the same word [1, 2]. In South Africa, English is widespread and can be regarded as a common denominator among languages. It is, however, not the most fre- quently used mother tongue by some margin. As a consequence, code-switching between English and the other languages per- meates the daily conversations of South Africans. Automatic speech recognition (ASR) systems deployed in this environment should therefore be able to process multilingual speech that in- cludes such code-switching. Although most state-of-the-art ASR systems are mono- lingual, the automatic recognition of speech including code switching has recently received increased attention [2–5]. In comparison with monolingual speech, code-switching in spon- taneous speech is highly unpredictable and difﬁcult to model. Despite recent advances in ASR achieved by the application of neural networks, the success for code-switched speech has been limited by the particular challenges this presents to acous- tic [1,6] and language modelling [7]. These challenges are even more acute when the languages concerned are under-resourced [6,8]. In South Africa, code-switching is prevalent between En- glish, a highly-resourced language, and the nine ofﬁcial African languages, which are all under-resourced. Two main strategies to deal with code-switching in ASR have been described in the literature. The ﬁrst incorpo- rates language identiﬁcation (LID) into the speech processing pipeline [9–11]. The LID component ﬁrst labels speech frames and monolingual ASR is subsequently used to perform de- coding. This approach has the advantage of simplicity, since conventional acoustic and language modelling methods, which achieve excellent monolingual performance, can be employed. However, language identiﬁcation is a difﬁcult task, especially in the presence of intra-word or intra-sentential code switching, and LID error propagation will lead to poor ASR by the mono- lingual recognisers. The second strategy is to perform single pass ASR that does not depend on LID [2,12]. This has the advantage of not requir- ing an explicit a-priori LID and can therefore in principle avoid the errors necessarily associated with incorrect LID. It does, however, require new methods of language and acoustic mod- elling which explicitly model and allow the abrupt language changes at a code switch during the recognition pass. Train- ing such acoustic and language models requires data, which is scarce for code-switched speech. In this work, we investigate whether improved acoustic modelling can be achieved by the application of multilingual neural network training approaches to English-isiZulu code- switched speech. We build on a ﬁrst study on these languages in which it was reported that language dependent acoustic mod- elling outperformed language independent acoustic modelling, but the reported word error rates were very high (> 80%) [1]. Our aim was to determine whether the recognition performance for English-isiZulu code-switched speech could be improved by leveraging additional code-switched data by means of multilin- gual neural network architectures. Both deep neural networks (DNNs) and time delay neural network - long short-term mem- ory (TDNN-LSTM) networks were considered for this purpose. Speciﬁcally we considered whether code switched data from other languages can be useful for acoustic modelling. In ad- dition, since some of the languages we consider are related, we evaluate the relative merits of multilingual modelling within and across these language families. 2. Corpus Details A multilingual corpus containing examples of code-switched speech has been compiled from 626 South African soap opera Interspeech 2018 2-6 September 2018, Hyderabad, India 2603 10.21437/Interspeech.2018-1711