STREAMLINING THE FRONT END OF A SPEECH RECOGNIZER Hua Yu and Alex Waibel Interactive Systems Lab, Carnegie Mellon University, Pittsburgh, PA 15213 Email: hyu@cs.cmu.edu ABSTRACT In this paper we seek to streamline various operations within the front end of a speech recognizer, both to reduce unnecessary computation and to simplify the conceptual framework. First, a novel view of the front end in terms of linear transformations is presented. Then we study the in- variance property of recognition performance with respect to linear transformations (LT) at the front end. Analysis reveals that several LT steps can be consolidated into a single LT, which effectively eliminates the Discrete Co- sine Transform (DCT) step, part of the traditional MFCC (Mel-Frequency Cepstral Coefficient) front end. Moreover, a highly simplified, data-driven front-end scheme is pro- posed as a direct generalization of this idea. The new setup has no Mel-scale filtering, another part of the MFCC front end. Experimental results show a 5% relative improvement on the Broadcast News task. 1. LINEAR TRANSFORMATIONS IN THE TRADITIONAL FRONT END The front end is a relatively independent component of a speech recognition system. Although the actual acoustic model parameters depend directly upon front-end parame- terization, researchers tend to view it as a black box. When testing several different front ends, the acoustic model struc- ture is seldom altered: it is simply a matter of plugging in another front end, re-estimating model parameters, and fi- nally choosing the one that yields the lowest WER (Word Error Rate). It is important to realize, however, that front-end de- sign and acoustic modeling are closely coupled. Below we will go through a typical front end commonly seen in most LVCSR systems, with an emphasis on connections between the two components: 1. First, the Fourier spectrum is warped to compensate for gender/speaker differences (Vocal Tract Length Normalization, or VTLN). 2. The warped spectrum is then smoothed by integrat- ing over triangular bins arranged along a non-linear 1 Here VTLN and Δ, ΔΔ steps are not shown for simplicity. DCT LDA CMN log() log() log() log() log() log() FFT coefficients Mel scale (x-u)/a Feature Vector filterbank (x-u)/a (x-u)/a Figure 1: A Typical MFCC Front End 1 scale. Mel-scale, the most commonly used one, is designed to approximate the frequency resolution of human ear, which is more sensitive at lower frequen- cies. Normally 30 triangular-shaped filters are used in JRTk (Janus Recognition Toolkit). 3. The log of the filter-bank output is taken to compress the dynamic range of the spectrum, so that the statis- tics of the estimated power spectrum are approxi- mately Gaussian. 4. Next, cepstral coefficients are obtained by applying a Discrete Cosine Transform (DCT) to the log filter- bank outputs. The goal is mostly to achieve a decor- relation effect so that the subsequent modeling us- ing diagonal covariance matrices is more valid. Typ- ically, the first 13 coefficients are retained. 5. Cepstral Mean Normalization (CMN) is commonly used to normalize for the channel effect, so we can build a “channel-blind” acoustic model later. 6. Delta and double-delta features are appended to the MFCC vector to capture speech dynamics. 7. Finally, LDA (Linear Discriminant Analysis) can be used for dimensional reduction. On top of LDA, there can be a further diagonalization transform so that the feature vector fits better with the diagonal covariance assumption in the acoustic model [4, 3, 6] . This is also called Maximum Likelihood Linear Transform (MLLT), which happens to be a special case of semi-tied covariance matrices [2].