Enhancing Large Vocabulary Continuous Speech Recognition System for
Urdu-English Conversational Code-Switched Speech
Muhammad Umar Farooq, Farah Adeeba, Sarmad Hussain, Sahar Rauf, Maryam Khalid
Center for Language Engineering,
Al-Khawarizmi Institute of Computer Science,
University of Engineering and Technology, Lahore.
{umar.farooq,farah.adeeba,sarmad.hussain,sahar.rauf,maryam.khalid}@kics.edu.pk
Abstract
This paper presents first step towards Large Vocabulary Con-
tinuous Speech Recognition (LVCSR) system for Urdu-English
code-switched conversational speech. Urdu is the national lan-
guage and lingua franca of Pakistan, with 100 million speakers
worldwide. English, on the other hand, is official language of
Pakistan and commonly mixed with Urdu in daily communi-
cation. Urdu, being under-resourced language, have no sub-
stantial Urdu-English code-switched corpus in hand to develop
speech recognition system. In this research, readily available
spontaneous Urdu speech corpus (25 hours) is revised to use
it for enhancement of read speech Urdu LVCSR to recognize
code-switched speech. This data set is split into 20 hours of
train and 5 hours of test set. 10 hours of Urdu BroadCast (BC)
data are collected and annotated in a semi-supervised way to en-
hance the system further. For acoustic modeling, state-of-the-
art DNN-HMM modeling technique is used without any prior
GMM-HMM training and alignments. Various techniques to
improve language model using monolingual data are investi-
gated. The overall percent Word Error Rate (WER) is reduced
from 40.71% to 26.95% on test set.
Index Terms: Urdu-English code-switching, Urdu speech
recognition, under-resourced language
1. Introduction
Code Switching (CS), spontaneous use of two or more lan-
guages in a single conversation, is a prevalent linguistic phe-
nomenon in multi-cultural societies or the countries where na-
tive and official languages are different. The dominant language
is usually referred as matrix language and the secondary lan-
guage is termed as embedded language. CS renders a monolin-
gual Natural Language Processing (NLP) system clueless about
language and muddles the context when system confronts the
embedded language. Therefore, CS is very challenging for
most of the monolingual NLP tasks such as Automatic Speech
Recognition (ASR), Part of Speech (POS) tagging, Machine
Translation (MT) and summarization. An increasing research
interest is observed in developing CS speech recognition sys-
tems [1, 2, 3] since most of the off-the-shelf systems are mono-
lingual.
The most difficult challenge in developing models for new
language pairs is the annotated data sparsity for code-switched
speech. It makes both acoustic and language modeling a Gor-
dian knot. The problem is exacerbated in case of low resource
languages which have even very small monolingual data. Urdu
is the national language and lingua franca of Pakistan which is
978-1-7281-9896-5/20/$31.00 ©2020 IEEE
spoken by more than a hundred million speakers in Pakistan,
India, Bangladesh and the regions of Europe [4]. English, be-
ing official language of Pakistan, is commonly mixed with na-
tional language Urdu in daily communication. Though English
is rich but Urdu is an under resourced language with small avail-
able monolingual data. Various code-switched speech recog-
nition systems including English-Mandarin [5], Frisian-Dutch
[6], English-Malay [7], French-Arabic [8] and Hindi-English
[9] have been studied but no such system for English-Urdu
code-switched speech exists so far.
Over the years, limited efforts have been made to develop
resources and speech technologies for Urdu. A recent research
[10] focused to fill this gap and a LVCSR was developed for
Urdu language. A neural network was trained on 300 hours
of read speech (from 1586 speakers) Urdu data which yielded
a WER of 13.5% on test set. The test set was 9 hours of un-
seen speech data (from 62 speakers). From previous to latest
researches were restrained to limited vocabulary [11], isolated
words [12] or small set of speakers [13].
Sarfarz et al. [14] designed and developed an Urdu speech
corpus of 44.5 hours. 25 hours of this corpus consisted of con-
versational speech. It was based on interview speech from var-
ious speakers which hinted that it incorporated Urdu-English
code-switching naturally. Rather than adopting dilatory process
of designing and collection of CS speech data, the aforemen-
tioned corpus was acquired to train the system. However, the
English words were forced transliterated in Urdu during anno-
tation of speech data. So, the corpus is reworked in this research
to make the text corpus code-switched. Furthermore, Urdu BC
news data is collected from online audio/video sources and an-
notated in a semi-supervised way. Most of the data is fetched
from YouTube and radio shows covering entertainment, political
and current affairs domains. 10 hours of Urdu spontaneous CS
speech is collected and added to train the acoustic model.
For acoustic model training, efforts are being made to re-
place widely used DNN-HMMs [1, 15] with end-to-end train-
ing [16, 17, 18, 19, 20]. The aim of such researches is to ex-
pedite ASR building by avoiding manual development of large
lexicons and corpus for language models. However, end-to-
end models’ performances are yet behind that of DNN-HMMs
[16, 17, 18]. Most of the neural networks in DNN-HMMs are
trained using alignments and context dependency tress from
GMM-HMM training [21]. However, a novel acoustic model-
ing strategy [22] is used to train the acoustic model which trains
the network in flat start manner and doesn’t rely on any previous
information.
In this paper, read speech Urdu LVCSR system is enhanced
to recognize code-switched and spontaneous speech. Urdu
spontaneous speech corpus is reworked to make it usable for
155