Enhancing Large Vocabulary Continuous Speech Recognition System for Urdu-English Conversational Code-Switched Speech Muhammad Umar Farooq, Farah Adeeba, Sarmad Hussain, Sahar Rauf, Maryam Khalid Center for Language Engineering, Al-Khawarizmi Institute of Computer Science, University of Engineering and Technology, Lahore. {umar.farooq,farah.adeeba,sarmad.hussain,sahar.rauf,maryam.khalid}@kics.edu.pk Abstract This paper presents ﬁrst step towards Large Vocabulary Con- tinuous Speech Recognition (LVCSR) system for Urdu-English code-switched conversational speech. Urdu is the national lan- guage and lingua franca of Pakistan, with 100 million speakers worldwide. English, on the other hand, is ofﬁcial language of Pakistan and commonly mixed with Urdu in daily communi- cation. Urdu, being under-resourced language, have no sub- stantial Urdu-English code-switched corpus in hand to develop speech recognition system. In this research, readily available spontaneous Urdu speech corpus (25 hours) is revised to use it for enhancement of read speech Urdu LVCSR to recognize code-switched speech. This data set is split into 20 hours of train and 5 hours of test set. 10 hours of Urdu BroadCast (BC) data are collected and annotated in a semi-supervised way to en- hance the system further. For acoustic modeling, state-of-the- art DNN-HMM modeling technique is used without any prior GMM-HMM training and alignments. Various techniques to improve language model using monolingual data are investi- gated. The overall percent Word Error Rate (WER) is reduced from 40.71% to 26.95% on test set. Index Terms: Urdu-English code-switching, Urdu speech recognition, under-resourced language 1. Introduction Code Switching (CS), spontaneous use of two or more lan- guages in a single conversation, is a prevalent linguistic phe- nomenon in multi-cultural societies or the countries where na- tive and ofﬁcial languages are different. The dominant language is usually referred as matrix language and the secondary lan- guage is termed as embedded language. CS renders a monolin- gual Natural Language Processing (NLP) system clueless about language and muddles the context when system confronts the embedded language. Therefore, CS is very challenging for most of the monolingual NLP tasks such as Automatic Speech Recognition (ASR), Part of Speech (POS) tagging, Machine Translation (MT) and summarization. An increasing research interest is observed in developing CS speech recognition sys- tems [1, 2, 3] since most of the off-the-shelf systems are mono- lingual. The most difﬁcult challenge in developing models for new language pairs is the annotated data sparsity for code-switched speech. It makes both acoustic and language modeling a Gor- dian knot. The problem is exacerbated in case of low resource languages which have even very small monolingual data. Urdu is the national language and lingua franca of Pakistan which is 978-1-7281-9896-5/20/$31.00 ©2020 IEEE spoken by more than a hundred million speakers in Pakistan, India, Bangladesh and the regions of Europe [4]. English, be- ing ofﬁcial language of Pakistan, is commonly mixed with na- tional language Urdu in daily communication. Though English is rich but Urdu is an under resourced language with small avail- able monolingual data. Various code-switched speech recog- nition systems including English-Mandarin [5], Frisian-Dutch [6], English-Malay [7], French-Arabic [8] and Hindi-English [9] have been studied but no such system for English-Urdu code-switched speech exists so far. Over the years, limited efforts have been made to develop resources and speech technologies for Urdu. A recent research [10] focused to ﬁll this gap and a LVCSR was developed for Urdu language. A neural network was trained on 300 hours of read speech (from 1586 speakers) Urdu data which yielded a WER of 13.5% on test set. The test set was 9 hours of un- seen speech data (from 62 speakers). From previous to latest researches were restrained to limited vocabulary [11], isolated words [12] or small set of speakers [13]. Sarfarz et al. [14] designed and developed an Urdu speech corpus of 44.5 hours. 25 hours of this corpus consisted of con- versational speech. It was based on interview speech from var- ious speakers which hinted that it incorporated Urdu-English code-switching naturally. Rather than adopting dilatory process of designing and collection of CS speech data, the aforemen- tioned corpus was acquired to train the system. However, the English words were forced transliterated in Urdu during anno- tation of speech data. So, the corpus is reworked in this research to make the text corpus code-switched. Furthermore, Urdu BC news data is collected from online audio/video sources and an- notated in a semi-supervised way. Most of the data is fetched from YouTube and radio shows covering entertainment, political and current affairs domains. 10 hours of Urdu spontaneous CS speech is collected and added to train the acoustic model. For acoustic model training, efforts are being made to re- place widely used DNN-HMMs [1, 15] with end-to-end train- ing [16, 17, 18, 19, 20]. The aim of such researches is to ex- pedite ASR building by avoiding manual development of large lexicons and corpus for language models. However, end-to- end models’ performances are yet behind that of DNN-HMMs [16, 17, 18]. Most of the neural networks in DNN-HMMs are trained using alignments and context dependency tress from GMM-HMM training [21]. However, a novel acoustic model- ing strategy [22] is used to train the acoustic model which trains the network in ﬂat start manner and doesn’t rely on any previous information. In this paper, read speech Urdu LVCSR system is enhanced to recognize code-switched and spontaneous speech. Urdu spontaneous speech corpus is reworked to make it usable for 155