Eurospeech 2001 - Scandinavia &RPSDULVRQRI0)&&DQG3/33DUDPHWHUL]DWLRQVLQWKH 6SHDNHU,QGHSHQGHQW&RQWLQXRXV6SHHFK5HFRJQLWLRQ7DVN -RVHI3VXWND/XG N0OOHUDQG-RVHI93VXWND University of West Bohemia, Department of Cybernetics, Univerzitní 8, 30614 Pilsen, Czech Republic psutka@kky.zcu.cz, muller@kky.zcu.cz,psutka_j@kky.zcu.cz $EVWUDFW The authors of this paper wish to contribute to the discussion about an optimal parameterization of speech signals in speech recognition systems. Our experiments deal with a telephone-based speaker independent continuous speech recognition task in which the MFCC and PLP parameteriza- tions were tested and compared. The benefit of an adjustment of the filters used in the MFCC and PLP parameterizations to the critical bandwidth of hearing [1] was explored and the impact of the number of filters and enumerated parameters to the recognition accuracy was tested. The results of these experiments showed that the MFCC parameterization is less sensitive to satisfying the theory of the critical bandwidth of hearing than the PLP parameterization. Experiments also proved that 5 PLP-cepstral (including derived 5 delta + 5 delta-delta) coefficients do not afford the best results as could be deduced from recent work [2], [3]. However, after optimal setting both parameterization techniques provided almost comparable results. ,QWURGXFWLRQ One of the most important components of the speech recognition system is the front-end. In recent papers we can find various recommended parameterization techniques or various modification of standard techniques. Only very few works try to appreciate and compare these techniques mutually and experiments with speaker independent continuous speech recognition are quite rare. On the other hand we can observe the “over-dimensioned” or more sporadically “under- dimensioned” front-ends which respect neither the application tasks nor the real working conditions as time and memory demand. This paper deals with the MFCC and PLP parame- terization techniques. It is well known that both techniques try to accommodate the parameter estimation process to the way how humans hear and how they perceive sounds with various frequencies. The concept of critical band rate and critical bandwidth is frequently applied in speech recognition from this point of view. While the problem area of critical-band rate especially for the PLP parameterization technique was discussed in [4], the following part of this article will deal with the critical bandwidth and its benefit both for the MFCC and PLP parameterization. In numerous tests, these techniques were compared for different numbers of filters distributed in the given frequency band and for different numbers of enumerated parameters. We also investigated the influence of the introduction of delta and delta-delta parameters and the impact of the mean and amplitude normalization of particular coefficients on the accuracy of recognition experiments. All experiments were performed on a continuous speech database pronounced by 100 speakers over a telephone channel. Because speakers called from various places in the Czech Republic the transfer conditions (e.g. noise, distortion, etc.) were generally slightly different for each call. Only the zero-gram language model was used during recognition experiments to gain a better view of the behavior of a word error rate (WER) caused by an adjustment of the front-end. 6SHHFKUHFRJQLWLRQFRQGLWLRQV The recognition experiments were performed with the recognition engine which is a part of a telephone dialogue system [5] built at the Department of Cybernetics, University of West Bohemia, Pilsen. The recognition engine is based on a statistical approach. It incorporates front-end, acoustic model, language model and decoding block. The basic speech unit of our system is a triphone. Each individual triphone is represented by a three states HMM with a continuous output probability density function assigned to each state. At present we use 8 mixtures of multivariate Gaussians for each state. As the number of Czech triphones is too large, phonetic decision trees were used to tie states of Czech triphones. The digitalization of an analogue telephone signal was provided by a telephone interface board DIALOGIC D/21D at 8 kHz sample rate and converted to the mu-law 8-bit resolution format. The aim of the front-end processor is to convert continuous speech into a sequence of feature vectors. This parameterization process can run either on the Mel-Frequency Cepstral Coefficients (MFCCs) or the PLP coefficients. The decoder uses a crossword context dependent HMM state network, which is generated by a Net generator. The input of the Net generator is a text grammar format represented by an extended BNF that respects the VoiceXML description. The whole net consists of one or more connected grammars. The decoder uses a Viterbi search technique with an efficient beam pruning. Because a variety of noise sounds, e.g. load breath, noise of a telephone channel can appear in an utterance, a set of noise HMM models was introduced and trained in order to capture these noise sounds. The speech material for all experiments was taken from the Czech telephone corpus collected at the Department of Cybernetics. The corpus consists of a read speech transmitted over a telephone channel. One hundred speakers were asked to read 40 senten- ces. These sentences were selected from Czech newspapers in order to contain the most occurring triphones of the Czech spoken language. The corpus obtained was manually annotated and phonetically transcribed. Then it was randomly divided so that 100 sentences created the test part and the remaining part of the corpus formed the training part. The