Two-Stage System for Robust Neutral/Lombard Speech Recognition Hynek Bo il 1 , Petr Fousek 1 , Harald Höge 2 1 Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic 2 Siemens Corporate Technology, Munich, Germany {borilh, p.fousek}@gmail.com, harald.hoege@siemens.com Abstract Performance of current speech recognition systems is significantly deteriorated when exposed to strongly noisy environment. It can be attributed to background noise and Lombard effect (LE). Attempts for LE-robust systems often display a tradeoff between LE-specific improvements and the portability to neutral speech. Therefore, towards LE-robust recognition, it seems effective to use a set of conditions- dedicated subsystems driven by a condition classifier, rather than attempting for one universal recognizer. Presented paper focuses on a design of a two-stage recognition system (TSR) comprising talking style classifier (neutral/LE) followed by two style-dedicated recognizers differing in input features. First, the binary neutral/LE classifier is built, with a particular interest in developing suitable features for the classification. Second, performance of common speech features (MFCC, PLP), LE-robust features (Expolog) and newly proposed features is compared in neutral/LE digit recognition tasks. In addition, robustness to the changes of average speech pitch and various noise backgrounds is evaluated. Third, the TSR is built, employing two recognizers, each using style-specific features. Comparison of the proposed system with either neutral- specific or LE-specific recognizer on a joint neutral/LE speech shows an improvement 6.5 4.2 % WER on neutral and 48.1 28.4 % WER on LE Czech utterances. Index Terms: Lombard effect, talking style classification, robust features, speech recognition 1. Introduction Lombard effect refers to changes in speech production introduced by speaker in an effort to maintain intelligible communication [1, 2]. Number of works has studied impact of noise on speech production. Some analyzed acoustic-phonetic variations in few discrete levels of noise background [1, 2, 4, 9, 10], others searched for a continuous dependency on the noise level [11, 12]. Significant differences in distributions of vocal intensity, fundamental frequency, glottal pulse shape and spectral tilt, locations and bandwidths of first formants, and other parameters were reported between LE and neutral speech [1], substantially impairing accuracy of recognizers employing neutral speech models, e.g. [1–5]. Efforts to improve the performance under LE include design of robust features [3, 4], equalization methods [5] and style-dependent training of acoustic models [1]. Condition-dependent training or design of robust features often results in a decrease of performance when the conditions change [1, 4]. This suggests addressing each of the conditions by a separate dedicated subsystem and implementing a switching mechanism – a condition classifier. Similar idea was proposed successfully in [6], where automatic neural network (ANN) talking style classifier was used to weight outputs of a codebook of style- dependent HMM recognizers. In [7], style classification and speech recognition were performed simultaneously by an N- channel HMM. To each speaking style one HMM dimension was allocated. The approach allowed for style classification on the HMM-state level. In this paper, a two-stage approach is proposed, using style classifier + independent neutral/LE recognizers. In the first stage, the utterances are classified on the speaking style and in the second stage they are passed to the corresponding dedicated recognizer. The paper is organized as follows. First, a set of selected features is tested on discriminability in the neutral/LE classification. Several possible setups are compared, the best of which yields the final classification feature vector (CFV). Subsequently, the CFV is used for training ANN and GMM based classifiers. Second, common speech features (MFCC, PLP), special LE-robust features (Expolog [3]) and newly proposed front-end modifications are tested in the neutral/LE digit recognition task, sharing a common back-end architecture. Robustness to changes of average utterance pitch and to emulated noisy backgrounds at various SNRs is compared. Finally, the TSR is designed, employing the style classifier and two recognizers, each using the best performing features found. All the presented experiments were carried out on the CLSD’05 database [8]. The database comprises recordings of Czech neutral speech and Lombard speech uttered in the simulated noisy conditions. In the latter case, a car noise of 95 dB SPL was presented to speakers by closed headphones, yielding high SNR of the recorded speech. 2. Classifying neutral/LE speech Based on previous studies, only the features providing significant style discriminability on the phoneme/gender- independent level were selected for the neutral/LE classification: vocal intensity, spectral slope of the voiced speech segments and mean and standard deviation of the fundamental frequency. Several frequency bands for spectral slope extraction are considered as well as linear and semitone fundamental frequency representations. Variants with superior discriminability are included in the CFV to train ANN and GMM classifiers. Analyses of feature distributions and training of classifiers were carried out on the development set comprising digits and phonetically rich sentences uttered by 8 female and 7 male speakers. Open test set comprised digits and sentences uttered by 4 male and 4 female speakers (disjunct from the development ones). 2.1. Features for neutral/LE classification Voice intensity, spectral slope and fundamental frequency (F 0 ) are extracted and averaged within the utterance, i.e. each utterance is parameterized by one mean feature vector. For the F 0 feature, also its standard deviation is included in CFV. INTERSPEECH 2007 August 27-31, Antwerp, Belgium 1074