TRAPPING CONVERSATIONAL SPEECH: EXTENDING TRAP/TANDEM APPROACHES TO CONVERSATIONAL TELEPHONE SPEECH RECOGNITION Nelson Morgan Barry Y. Chen Qifeng Zhu Andreas Stolcke International Computer Science Institute, Berkeley, CA, USA University of California Berkeley, Berkeley, CA, USA SRI International, Menlo Park, CA, USA morgan, byc, qifeng, stolcke @icsi.berkeley.edu ABSTRACT TempoRAl Patterns (TRAPs) and Tandem MLP/HMM approaches incorporate feature streams computed from longer time intervals than the conventional short-time analysis. These methods have been used for challenging small- and medium-vocabulary recog- nition tasks, such as Aurora and SPINE. Conversational telephone speech recognition is a difﬁcult large-vocabulary task, with current systems giving incorrect output for 20-40% of the words, depend- ing on the system complexity and test set. Training and test times for this problem also tend to be relatively long, making rapid de- velopment quite difﬁcult. In this paper we report experiments with a reduced conversational speech task that led to the adoption of a number of engineering decisions for the design of an acoustic front end. We then describe our results with this front end on a full- vocabulary conversational telephone speech task. In both cases the front end yielded signiﬁcant improvements over the baseline. 1. AUGMENTING CONVENTIONAL FEATURES For decades, the feature extraction component of speech recogni- tion engines has consisted of some form of local spectral enve- lope estimation, typically with some simple transformation; cur- rent front ends are based largely on the Mel cepstrum or percep- tual linear prediction (PLP) [1] computed from an analysis win- dow of roughly 25 or 30 ms surrounding a central signal point, stepped along every 10 ms. A number of alternatives have been developed in recent years. One such approach, tandem acoustic modeling [2, 3, 4] uses a multi-layer perceptron (MLP) to ﬁrst discriminatively transform multiple feature vectors (typically PLP from 9 frames) before using them as observations for Gaussian mixtures hidden Markov models (GMHMM). Thus, the neural net- work, which could be called a “feature net”, incorporates around 100 ms of speech. In this paper we will refer to the resulting vari- ables as PLP/MLP features. Others have also tried incorporating longer temporal information yielding signiﬁcant improvements in speech recognition performance (e.g., [5]). The MLP is typically trained using phonetic targets. This ap- proach works very well in matched training and test conditions, often achieving lower word error rates than systems without the discriminant nonlinear transformation provided by the MLP. How- ever, in the case of mismatched training and testing conditions, ICSI and OGI researchers working on the Aurora task found it preferable to augment the original features with the feature net outputs, essentially using the concatenation of the original features and the PLP/MLP features as the front end for the GMHMM [6]. A similar approach was used in [7], where standard features were augmented by a complimentary source of information (in this case, estimates of formants from a mixture of Gaussians). Another promising approach has been to combine the PLP/MLP features with features derived from the outputs of MLPs incorporating long-time log critical band energy trajectories (500 ms - 1 s) [8, 9]. The set of these MLPs forms the TRAPS system, named as such because the system learns discriminative Tempo- RAl Patterns (TRAPS) in speech. MLPs in the TRAPS system are also trained with phonetic targets. We have observed that systems using the combination of the two feature sets perform better than those using either feature type alone. The approaches listed above were developed on small tasks, i.e., connected digits, continuous numbers, and TIMIT phone recognition, where the training and testing sets were small in both vocabulary and data size. We have now tested systems that incor- porate these features in two progressively larger tasks. We used conventional front end features (12th order PLP plus energy and derivatives), augmented with the combination of PLP/MLP and TRAPS features. These corresponded to three different temporal spans. The original PLP features were derived from short term spectral analysis (25 ms time slices every 10 ms). In contrast, PLP/MLP used 9 frames of PLP features (100ms), and TRAPS used 51 frames of log critical band energies (500ms). For the PLP/MLP stream, we trained discriminative feature net MLPs us- ing 46 phoneme targets generated from forced alignments using the SRI DECIPHER recognizer. For the second stream, the ﬁrst stage TRAPS MLPs took log critical band energy trajectories, formed by taking 51 consecutive frames of log critical band en- ergies every 10ms, and transformed by principal component anal- ysis (PCA). These critical band MLPs were trained with the same phoneme targets as in the feature net MLP. A “merger” MLP (trained with these same phoneme targets) combined the output of the critical band MLPs to produce a single estimate of phoneme posteriors every 10 ms. Since the outputs of both the TRAPS classiﬁer and the PLP net can be interpreted as posterior probabilities of the 46 phonemes, we could combine them using frame-wise posterior probability combination techniques [10, 11] (described brieﬂy below). Af- ter combination, we took the log of the posterior vector to make it more Gaussian, and then orthogonalized and reduced the dimen- sionality of the posterior vector using PCA. The resulting variables were then appended to the original PLP cepstra to form the aug- mented feature vector. Figure 1 summarizes this process. In what follows, we refer to these augmented feature vectors