Methodology of Lombard Speech Database Acquisition: Experiences with CLSD Hynek Bořil, Tomáš Bořil, Petr Pollák Faculty of Electrical Engineering Czech Technical University in Prague, Czech Republic borilh@gmail.com, borilt@gmail.com, pollak@fel.cvut.cz Abstract In this paper, process of the Czech Lombard Speech Database (CLSD’05) acquisition is presented. Feature analyses have proven a strong appearance of Lombard effect in the database. In the small vocabulary recognition task, significant performance degradation was observed for the Lombard speech recorded in the database. Aim of this paper is to describe the hardware platform, scenarios and recording tool used for the acquisition of CLSD’05. During the database recording and processing, several difficulties were encountered. The most important question was how to adjust the level of speech feedback for the speaker. A method for minimization of the speech attenuation introduced to the speaker by headphones is proposed in this paper. Finally, contents and corpus of the database are presented to outline it’s suitability for analysis and modeling of Lombard effect. The whole CLSD’05 database with a detailed documentation is now released for public use. 1. Introduction A great effort is being made to increase robustness of automatic speech recognition systems in order to allow for building of voice-controlled devices operating reliably in adverse environments. In noisy conditions, recognition is not only degraded by presence of the disturbing background but also by Lombard effect (LE) representing speech production changes introduced by speaker in an effort to increase communication intelligibility. Speech databases acquired in real conditions provide valuable material for recognition systems, but in case of louder backgrounds (crowded places, moving car, airplane cockpits) it may be problematic to analyze impact of speech feature variations caused by LE separately from the impact of the noise present in the recordings. Also assuring similar recording conditions and appropriate speaker reactions to the actual noise may be an issue in the real conditions. During the recording, speakers may tend to concentrate just on the correct pronunciation of the text without adequate reaction to the actual conditions. Databases focused on LE usually introduce simulated noisy background to the speaker through headphones, hence high SNR of the recorded speech is preserved and the recording conditions can be easily controlled (Hansen, 1996; Chi & Oh, 1996; Wakao et al., 1996). In (Bořil & Pollák, 2005, 1), basic properties of the CLSD’05 database were introduced. In (Bořil & Pollák, 2005, 2), overall Lombard speech features of CLSD’05 were analyzed and compared to two large Czech databases containing recordings from the moving car. In CLSD’05, appearance of LE has been found significantly stronger than in case of the other two databases. In this paper, recording platform and contents of CLSD’05 are presented and extensions of the setup proposed. 2. CLSD’05 recording platform To enable observations of LE influence on speech features on the speaker level, each speaker was recorded both in neutral and simulated noisy scenario. Our experiences from the recordings in natural environments show that speakers often tend to ignore actual environmental changes and concentrate just on the correct reading of the prompts. This approach does not follow a real communication and thus from the viewpoint of the speaker, there is no need to preserve intelligibility of the speech much. To avoid this, it seems reasonable to introduce a communication element into the recording process. 2.1. Recording setup In the simulated noisy conditions, noise samples mixed with the speech feedback are reproduced to speaker by closed headphones. An operator qualifies utterance intelligibility while hearing the same noise mixed with speaker’s voice of intensity decreased according to the selected virtual distance, see Fig. 1. If the speech cannot be understood well, the operator asks for repeating. It was observed that after several requests for repeating of an item speakers started to react to the actual noise appropriately. In the neutral scenario, speaker does not wear headphones while reading the prompts. In both scenarios, the speech is sensed by two microphones placed in the different distances. Recording set consists of 2 closed headphones AKG K44, close talk microphone Sennheiser ME-104 and hands-free microphone Nokia NB2. These microphones were chosen to fit Czech SPEECON recording conditions (SPEECON, 2001). Figure 1: Recording setup 2.2. SPL adjustment In the beginning of each Lombard session recording it was necessary to adjust level of the reproduced background noises. For this purpose, a transfer function between sound card open circuit effective voltage V RMS_OL and SPL in headphones was determined by measurement on a dummy head, see Fig. 2. For the required noise level, corresponding V RMS_OL was then set up. Constant 90 dB SPL and 1-3 meters of virtual distance were chosen for the Close talk Noise + speech feedback Middle talk H&T RECORDER OK – next / BAD - again OPERATOR SPEAKER Noise + speech monitor 1644