The LIMSI Continuous Speech Dictation System: Evaluation on the ARPA Wall Street Journal Task J.L. Gauvain, L.F. Lamel, G. Adda, M. Adda-Decker LIMSI-CNRS, BP 133 91403 Orsay cedex, FRANCE gauvain,lamel,adda,madda @limsi.fr ABSTRACT In this paper we report progress made at LIMSI in speaker- independent large vocabulary speech dictation using the ARPA Wall Street Journal-based CSR corpus. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic modeling and n-gram statistics estimated on the newspaper texts for language modeling. The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with vo- cabularies of up to 20K words when used with bigram back-off language models. A second forward pass, which makes use of a word graph generated with the bigram, incorporates a trigram lan- guage model. Acoustic modeling uses cepstrum-based features, context-dependent phone models (intra and interword), phone du- ration models, and sex-dependentmodels. The recognizer has been evaluated in the Nov92 and Nov93 ARPA tests for vocabularies of up to 20,000 words. INTRODUCTION Our speech recognition research focuses on develop- ing recognizers that are task-, speaker- and vocabulary- independent so as to be easily adapted to a variety of applica- tions. In this paper we report on our efforts in large vocabu- lary, speaker-independent continuous speech recognition us- ing the ARPA Wall Street Journal-based CSR corpus[11]. The WSJ corpus contains large amounts of read speech ma- terial from a large number of speakers and has associated text material which can be used as a source for statistical lan- guage modeling. The recognizer makes use of continuous density HMM with Gaussian mixture for acoustic model- ing and n-gram statistics estimated on text material for lan- guage modeling. Acoustic modeling uses cepstrum-based features, context-dependent phone models, duration mod- els, and sex-dependent models. Statistical n-gram language models are estimated on a large corpus of newspaper text from the WSJ. The recognizer has also been evaluated on comparable tasks for the BREF corpus and results were re- ported at EUROSPEECH-93[3]. In the following sections we describe the recognizer and present an evaluation of the current system on the last two sets of evaluation test material: Nov92[12] and Nov93. RECOGNIZER OVERVIEW The recognizer uses a time-synchronous graph-search strategy which is shown to still be viable with vocabularies of up to 20K words, when used with bigram back-off language models (LMs). This one level implementation includes intra- and inter-word context-dependent (CD) phone models, intra- and inter-word phonological rules, phone duration models, and gender-dependent models[6]. The HMM-based word recognizer graph is built by putting together word models according to the grammar in one large HMM. Each word model is obtained by concatenation of phone models accord- ing to the word’s phone transcription in the lexicon. The recognizer makes use of continuous density HMM (CDHMM) with Gaussian mixture for acoustic modeling. The main advantage continuous density modeling offers over discrete or semi-continuous (or tied-mixture) observation density is that the number of parameters used to modelize an HMM observation distribution can easily be adapted to the amount of available training data associated to this state. In the experimental section we demonstrate the improvement in performance obtained on the same test data by simply us- ing additional training material. As a consequence, high precision modeling can be achieved for highly frequented states without the explicit need of smoothing techniques for the densities of less frequented states. Discrete and semi- continuous modeling use a fixed number of parameters to represent a given observation density and therefore cannot achieve high precision without the use of smoothing tech- niques. This problem can be alleviated by tying some states of the Markov models in order to have more training data to estimate each state distribution. However, since this kind of tying requires careful design and some a priori assumptions, these techniques are primarily of interest when the training data is limited and cannot easily be increased. Front end: A 48-component feature vector is computed ev- ery 10 ms. This feature vector consists of 16 Bark- frequency scale cepstrum coefficients computed on the 8kHz bandwidth with their first and second order deriva- tives. For each frame (30 ms window), a 15 channel Bark power spectrum is obtained by applying triangular win- ICASSP-94