Toward Domain-Independent Conversational Speech Recognition Brian Kingsbury, Lidia Mangu, George Saon, Geoffrey Zweig, Scott Axelrod, Vaibhava Goel, Karthik Visweswariah, and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights,NY, USA {bedk,mangu,gsaon,gzweig,axelrod,vgoel,kv1,picheny}@us.ibm.com Abstract We describe a multi-domain, conversational test set developed for IBM’s Superhuman speech recognition project and our 2002 benchmark system for this task. Through the use of multi- pass decoding, unsupervised adaptation and combination of hy- potheses from systems using diverse feature sets and acous- tic models, we achieve a word error rate of 32.0% on data drawn from voicemail messages, two-person conversations and multiple-person meetings. 1. Introduction The goal of IBM’s Superhuman speech recognition project [1, 2] is to develop a domain-independent speech recognition sys- tem that matches or exceeds human performance across the full range of possible application domains, acoustic conditions and speaker characteristics. To foster work toward this admittedly aggressive objective, we have defined a test set comprising con- versational American English material drawn from a number of application domains and set a goal of achieving annual 25% relative improvements in word error rate on this test set. In this paper, we describe the components of our “Superhu- man” test corpus, then describe our 2002 benchmark system for the corpus. To deal with the broad range of material present in the test set, we employed a recognition strategy based on multi- ple passes of recognition interleaved with unsupervised acoustic model adaptation and on combination of recognition hypotheses from systems using disparate feature sets and acoustic models. We describe the basic techniques used in the benchmark system for signal processing, acoustic modeling, adaptation, and lan- guage modeling, then we describe the architecture of the recog- nition system and present the performance of the system on the Superhuman test corpus at various stages of processing. 2. A multi-domain conversational test set We had a number of goals in mind when we designed the test set. First, the test set had to cover a reasonably broad range of conversational applications and contain data representing key challenges to reliable recognition including various forms of acoustic interference, speech from non-native speakers, and a large recognition vocabulary. Second, the test set had to in- clude at least one component that is readily available to other researchers to facilitate comparisons between our recognizers and those developed externally. Third, the test set needed to be reasonably small, to facilitate rapid turnaround of experiments on it. For all experiments reported here, we used a test set com- posed of the following five parts. swb98 The Switchboard portion of the 1998 Hub 5e eval- uation set, comprising 113 minutes of audio. The data are col- lected from two-person conversations between strangers on a preassigned topic. A variety of telephone channels and regional dialects are represented in the data. mtg An initial release of the Bmr007 meeting from the ICSI Meeting corpus [3, 4], comprising 95 minutes of audio. The data are collected from eight speakers wearing either lapel microphones or close-talking headsets. This meeting involved eight speakers: five native speakers of American English (two females and three males) and three non-native speakers (all males). The primary challenge in this test set is the presence of background speech in many of the utterances. The crosstalk problem is especially severe for speakers recorded using lapel microphones. cc1 30 minutes of audio from a call center. The data are col- lected from customer service representatives (CSRs) and cus- tomers calling with service requests or problems. The primary challenge in this test is acoustic interference: a combination of nonlinear distortion from speech compression, background noise from both the CSR and customer sides, and intermittent beeps on the channel which are played to remind the customer that the call is being recorded. cc2 34 minutes of audio from a second call center. The recordings are from a different center than the cc1 test set, but cover similar subject matter and have similar, poor acoustics. This data set has no information associating speakers with sets of utterances, which poses problems for speaker and channel adaptation. vm Test data from the IBM Voicemail corpus, comprising 52 minutes of audio. This material was previously reported on as the E-VM1 test set [5], and is a superset of the test data in the Voicemail Corpus Part I and Part II distributed by the LDC. Un- like the other tests, the voicemail data are conversational mono- logues. The acoustic quality of the data are generally quite high, although loud clicks caused by the speaker hanging up at the end of some messages can pose problems for feature normal- ization — especially normalization of c0 based on the maxi- mum value of c0 within an utterance. This test set also has no information associating speakers with sets of utterances. 3. Signal processing The systems in this work use either Mel-frequency cepstral co- efficient (MFCC) or perceptual linear prediction (PLP) features as raw features. The MFCC features are based on a bank of 24 Mel filters spanning 0–4.0 kHz. The PLP features are based on a bank of 18 Mel filters spanning 0.125–3.8 kHz and use a 12th-order autoregressive analysis to model the auditory spec- trum. Both feature sets are based on an initial spectral analysis that uses 25-ms. frames smoothed with a Hamming window, a 10-ms. frame step, and adds the equivalent of 1 bit of noise to the power spectra as a form of flooring. Both feature sets also are computed using periodogram averaging to reduce the vari- EUROSPEECH 2003 - GENEVA 1881