Data Augmentation, Feature Combination, and Multilingual Neural Networks to Improve ASR and KWS Performance for Low-resource Languages Zolt´ an T¨ uske 1 , Pavel Golik 1 , David Nolden 1 , Ralf Schl¨ uter 1 , Hermann Ney 1,2 1 Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, 52056 Aachen, Germany 2 Spoken Language Processing Group, LIMSI CNRS, Paris, France {tuske, golik, nolden, schlueter, ney}@cs.rwth-aachen.de Abstract This paper presents the progress of acoustic models for low- resourced languages (Assamese, Bengali, Haitian Creole, Lao, Zulu) developed within the second evaluation campaign of the IARPA Babel project. This year, the main focus of the project is put on training high-performing automatic speech recogni- tion (ASR) and keyword search (KWS) systems from language resources limited to about 10 hours of transcribed speech data. Optimizing the structure of Multilayer Perceptron (MLP) based feature extraction and switching from the sigmoid activation function to rectified linear units results in about 5% relative improvement over baseline MLP features. Further improve- ments are obtained when the MLPs are trained on multiple fea- ture streams and by exploiting label preserving data augmenta- tion techniques like vocal tract length perturbation. Systematic application of these methods allows to improve the unilingual systems by 4-6% absolute in WER and 0.064-0.105 absolute in MTWV. Transfer and adaptation of multilingually trained MLPs lead to additional gains, clearly exceeding the project goal of 0.3 MTWV even when only the limited language pack of the target language is used. Index Terms: ASR, KWS, MTWV, MLP, rectified linear units, multilingual, low-resource 1. Introduction Speech technologies are applied to a growing number of lan- guages. Thus, there is a large interest for methods which ease the training and improve the quality of the models, especially in the first steps of the development phase where only lim- ited amount of data is available. The Babel project funded by IARPA is addressing these goals by the development of robust speech technologies, focusing on spoken term detection, which can be applied to any language with a limited amount of tran- scription in a limited time [1]. With the progress of the project the main focus is moved, this year the participants have to ac- complish the desirable keyword search (KWS) performance us- ing limited (about 10 hours of speech) transcription on the fol- lowing languages: Assamese, Bengali, Haitian Creole, Lao, and Zulu. As has been already shown, Neural Networks (NN) play a key role in achieving the project goals [2, 3] either by of the tan- dem [4] or the hybrid acoustic modeling approach [5]. Applying multilingual training of e.g. [6] to deep Multilayer Perceptrons (MLP), [7, 8, 9, 10] demonstrated that borrowing orders of mag- nitude more data from other languages improves the ASR and KWS performance enormously if only limited amount of data is available in the target language. In previous years, novel non-linearities which are biolog- ically more plausible than sigmoid have been proposed for NN. For instance, maxout and Rectified Linear Units (ReLU) have been successfully applied to machine learning problems [11, 12]. In [13] the authors showed significant improvement using ReLU on a Large Vocabulary Continuous Speech Recog- nition (LVCSR) task. Besides the different activation units introduced recently, label preserving data augmentation tech- niques – widely used in image recognition tasks – also show consistent improvement for low-resource speech recognition [14, 15]. Furthermore, exploiting multiple representation of the speech signal, neural network based feature combination re- sulted in considerable improvement on Spanish broadcast news and conversation LVCSR task [16]. Therefore, in this paper we investigate the application of data perturbation, feature combination, and ReLU activation units to improve low-resourced ASR and KWS systems for a diverse set of languages. The experiments are carried out with the tandem approach. According to the primary goal of the Ba- bel project we concentrate on unilingual systems, however, the best MLP architectures and techniques are also tested with mul- tilingual approaches as well. The paper is organized as follows, Section 2 gives a short corpus description and the overview of the keyword search task of the Babel Program. We give a summary of the investigated methods and the details on our experimental setups in Section 3. The ASR and KWS results are presented in Section 4. The paper closes with conclusions in Section 5. 2. Task description One of the main goals of the IARPA-Babel Program is to reduce the performance gap of speech applications between high-resource well-studied languages (like English) and low- resource languages which have not yet been researched exten- sively. The participants compete in keyword search evaluations. The performance is measured in Actual Term Weighted Value (ATWV) based on the average value lost per term [17]. The loss is a weighted linear combination of the probabilities of miss and false alarm errors at the actual detection threshold. The thresh- old is optimized on a development corpus with a development keyword set by maximizing the term weighted value (MTWV). The performer should achieve a minimum of 0.3 ATWV on the evaluation set with the evaluation keyword set. On each language more than 100 hours of data are collected – full language pack (FLP) –, however, a considerable portion is non-speech, and in the current period about 75% of the corpus is transcribed. The limited language pack (LLP) comprises only about 10 hours of speech. The ASR and KWS tasks are chal- Copyright 2014 ISCA 14 - 18 September 2014, Singapore INTERSPEECH 2014 1420