The LIMSI 1995 Hub3 System J.L. Gauvain, L. Lamel, G. Adda, D. Matrouf LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE gauvain,lamel,gadda,driss @limsi.fr ABSTRACT In this paper we report on the LIMSI recognizer evaluated in the ARPA 1995 North American Business (NAB) News Hub 3 benchmark test. The LIMSI recognizer is an HMM-based system with Gaussian mixture. Decoding is carried out in multiple for- ward acoustic passes, where more refined acoustic and language models are used in successive passes and information is transmit- ted via word graphs. In order to deal with the varied acoustic conditions, channel compensation is performed iteratively, refin- ing the noise estimates before the first three decoding passes. The final decoding pass is carried out with speaker-adapted models ob- tained via unsupervised adaptation using the MLLR method. In contrast to previous evaluations, the new Hub 3 test aimed at im- proving basic SI, CSR performance on unlimited-vocabulary read speech recorded under more varied acoustical conditions (back- ground environmental noise and unknown microphones). On the Sennheiser microphone (average SNR 29dB) a word error of 9.1% was obtained, which can be compared to 17.5% on the secondary microphone data (average SNR 15dB) using the same recognition system. INTRODUCTION In this paper we report on the LIMSI speech recognizer used in the ARPA November 1995 evaluation on the North American Business (NAB) News task[13]. LIMSI has partic- ipated in annual ARPA sponsored continuous speech recogni- tion evaluations aimed at improving basic speech recognition technology since November 1992. The goal of the 1995 Hub 3 task was to “improve basic speaker-independent perfor- mance on unlimited-vocabulary read speech under acoustical conditions that are somewhat more varied and degraded than speech used in previous ARPA evaluations”. Besides the problems posed by the unlimited vocabulary dictation task on reasonably clean speech data (such as the WSJ0/WSJ1 corpus), one of the major challenges of the Nov95 evalua- tion was to achieve acceptable performance on other (ie. non close-talking) microphone data with no prior knowledge of either the microphone type or the background noise charac- teristics. In the next section we provide an overview of the LIMSI speech recognition system and the decoder strategy. We then describe our development work in language model- ing, including the text processing and vocabulary selection. The recognition lexicon is presented along with a descrip- tion of our semi-automatic method for adding pronunciations for new words. We then return to the experiments carried out with acoustic modeling and environmental compensa- tion aimed at improving performance on the noisy data. In contrast to previous evaluations, where for the primary sys- tem each sentence was treated independently (i.e., the results must be independent of the order in which the test sentences were processed), this year we used the knowledge of the arti- cle boundaries and utterance order to carry out unsupervised transcription-mode adaptation. RECOGNIZER OVERVIEW The LIMSI speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and n-gram statistics estimated on newspaper texts for lan- guage modeling. The recognition vocabulary contains 65k words selected to minimize the out-of-vocabulary rate on a set-aside portion of the development text set. Bigram and trigram language models were trained on 284M words of text and read WSJ0/1 speech transcriptions predating July 30, 1995 (inclusive). Context-dependent phone models were trained on the Sennheiser channel on 46k sentences taken from the WSJ0/1 corpus. The decoding is carried out in multiple passes, with more accurate models in successive passes. All passes use cross-word CD phone models. Re- vised noise estimates are made in between decoding passes and unsupervised speaker adaptation is carried out in the final pass. Acoustic models Acoustic modeling uses 48 cepstral parameters derived from a Mel frequency spectrum estimated on the 0-8kHz band every 10ms (30ms window). Cepstral mean removal was per- formed for each sentence. The models were trained on 46,146 sentences (about 99 hours of speech) from 355 speakers of the WSJ0/1 corpus. This is comprised of 37,518 sentences from the WSJ0/1 SI-284 corpus, 130 sentences/speaker from 57 long-term and journalist speakers in WSJ0/1, and 1218 sentences from 14 of the 17 additional WSJ0 speakers not included in SI-84. Only the data from the close-talking Sennheiser HMD-410 microphone was used for training.