Improving the Lwazi ASR baseline Charl van Heerden 12 , Neil Kleynhans 12 and Marelie Davel 12 1 Multilingual Speech Technologies, North-West University, South Africa. 2 NWU-CAIR, CSIR Meraka, South Africa. cvheerden@gmail.com Abstract We investigate the impact of recent advances in speech recog- nition techniques for under-resourced languages. Specifically, we review earlier results published on the Lwazi ASR corpus of South African languages, and experiment with additional acoustic modeling approaches. We demonstrate large gains by applying current state-of-the-art techniques, even if the data it- self is neither extended nor improved. We analyze the vari- ous performance improvements observed, report on compara- tive performance per technique – across all eleven languages in the corpus – and discuss the implications of our findings for under-resourced languages in general. Index Terms: speech recognition, Lwazi, Lwazi ASR corpus, phone recognition, South African languages. 1. Introduction Automatic speech recognition (ASR) of under-resourced lan- guages is a topic that has garnered increasing interest over the past decade [1]. Targeted data collection efforts such as Glob- alphone [2], Babel [3] and others [4, 5] have steadily increased the language coverage of available speech corpora. At the same time, freely available tools for data collection [6, 7, 8] have made small localized corpus development much easier, also contributing to the growing pool of curated ASR training data. Still, the majority of sub-Saharan African languages remain under-resourced, with limited or no speech resources available for the study of many of these languages. In parallel with work targeted at dealing with issues specific to under-resourced languages, recent developments in main- stream ASR research have resulted in clear performance im- provements. Specifically, the application of deep neural net- works [9], sub-space Gaussian modeling [10] and the packaging of many of these techniques within the Kaldi toolkit [11], have contributed to improved performance in well-resourced ASR systems. In this study, we revisit earlier baselines obtained on the Lwazi ASR corpus [12], a small freely available corpus of tele- phony speech in the eleven official languages of South Africa, and determine how these baselines are affected by recent de- velopments. We consider performance trends across a range of languages, in order to better understand the implications for smaller ASR corpora in general. 2. Background As background to this work, we provide an overview of the Lwazi corpus (Section 2.1), discuss earlier baselines obtained on this corpus (Section 2.2), and touch on those recent develop- ments in ASR that we focus on in this study (Section 2.3). 2.1. The Lwazi corpus The Lwazi project [13] was originally conceptualized to demon- strate the potential of speech technologies in providing access to information [14]. At the end of the first phase (2006–2009), basic speech recognition and text-to-speech systems were de- veloped in all eleven of South Africa’s official languages. (For the majority of these languages, this was the first time such tech- nologies were developed.) In addition, resources developed in- cluded annotated speech corpora [12] and electronic pronunci- ation dictionaries [15], all of which were made available freely via the South African Resource Management Agency [16]. The languages included in the corpus are listed in Table 1, with nine of the eleven languages from the Southern Bantu (SB) family. Per language, the ISO 639-3:2007 language code, lan- guage family and estimated number of first language speakers in South Africa are shown. The majority of Southern-Bantu languages are from two language families – Nguni and Sotho- Tswana – with Tshivenda and Xitsonga from two additional lan- guage families. Table 1: Languages in the Lwazi ASR corpus [17, 18]. language ISO code # million language speakers family isiZulu zul 11.6 SB:Nguni isiXhosa xho 8.2 SB:Nguni Afrikaans afr 6.9 Germanic English eng 4.9 Germanic Sepedi nso 4.6 SB:Sotho-Tswana Setswana tsn 4.0 SB:Sotho-Tswana Sesotho sot 3.8 SB:Sotho-Tswana Xitsonga tso 2.3 SB:Tswa-Ronga siSwati ssw 1.3 SB:Nguni Tshivenda ven 1.2 SB:Venda isiNdebele nbl 1.1 SB:Nguni The Lwazi 1 corpus was the first set of resources available in all of South Africa’s official languages, but is very small in today’s terms – consisting of between 4 and 10 hours or speech per language (see Table 2). 2.2. Earlier baselines Various earlier results have been published with regard to the Lwazi corpus [12, 19, 20, 21, 22], not all directly comparable to the work here. The first recognition results on the Lwazi corpus [12] utilized an earlier version of the corpus, and two follow-up papers experimented with concept recognition [19] and data pooling [20], respectively. Given the construction of the corpus (prompts selected from a limited set of government documents) word-based recognition is heavily biased towards