FLEXIBLE MULTI-STREAM FRAMEWORK FOR SPEECH RECOGNITION USING MULTI-TAPE FINITE-STATE TRANSDUCERS I. Lee Hetherington, Han Shu, and James R. Glass Computer Science and Artiﬁcial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA {ilh, hshu, jrg}@csail.mit.edu ABSTRACT We present an approach to general multi-stream recognition uti- lizing multi-tape ﬁnite-state transducers (FSTs). The approach is novel in that each of the multiple “streams” of features can rep- resent either a sequence (e.g., ﬁxed- or variable-rate frames) or a directed acyclic graph (e.g., containing hypothesized phonetic seg- mentations). Each transition of the multi-tape FST speciﬁes the models to be applied to each stream and the degree of feature stream asynchrony to allow. We show how this framework can easily rep- resent the 2-stream variable-rate landmark and segment modeling utilized by our baseline SUMMIT speech recognizer. We present experiments merging standard hidden Markov models (HMMs) with landmark models on the Wall Street Journal speech recognition task, and ﬁnd that some degree of asynchrony can be critical when com- bining different types of models. We also present experiments per- forming audio-visual speech recognition on the AV-TIMITtask. 1. INTRODUCTION Most commonly, speech recognition systems utilize a single stream of features, often a ﬁxed-rate sequence of observation vectors (e.g., MFCCs and their derivates) modeled using hidden Markov models (HMMs). Extensions to this traditional HMM approach include seg- mental (e.g., whole-phone) modeling [1, 2], multi-stream sub-band modeling [3], multi-stream multi-rate modeling [4, 5], articulatory- inspired modeling [6, 7], multi-modal recognition [8], and audio- visual speech recognition [9, 10, 11], among others. Many of these multi-stream approaches vary in how often information from the dif- ferent streams is integrated (e.g., every state, every phone or sylla- ble boundary, or at the end of the utterance) and whether the initial search utilizes all streams or rather additional streams are integrated in a multi-pass approach. Our SUMMIT speech recognition system [2] has long integrated two feature streams, landmarks and segments, and integrated them at phone boundaries in an integrated search. At such phone bound- aries, determined automatically by the search, the landmark and seg- ment feature streams are fully synchronized in time. When perform- ing some initial experiments combining a traditional HMM with our landmark and segment models, we found that synchronization with the HMM was an issue. We found that our landmark/segment sys- tem preferred different phone boundaries as compared to a context- dependent HMM, and thus desired a framework to explore asyn- chrony in addition to multiple feature streams. Others have found Support for this research was provided in part by the National Science Foundation under grant #IIS-0415865. that context-dependent HMMs prefer phonetic alignments that may not well match transcriptions or other models, including context- independent HMMs [12], and thus allowing some degree of asyn- chrony between HMMs and other models may be critical to success- ful integration. In this paper we present our multi-stream framework that utilizes a multi-tape ﬁnite-state transducer (FST) to express how multiple feature streams are combined and the allowable asynchrony between them at different parts of the search. Related work includes multi-stream recognition by HMM re- combination by Bourlard, Dupont, et al. [3, 9], in which HMMs representing different streams are allowed to evolve independently until encountering special synchronization states. The multi-rate HMM framework of C ¸ etin and Ostendorf [5] utilizes graphical mod- els and allows different streams to operate at different rates. The multi-modal approach of Johnston and Bangalore et al. [8] jointly recognizes gestures and speech using multi-tape FSTs, with integra- tion of the modalities occurring at the end of the utterance (either two passes or search through recognition lattices computed on each modality). In Section 2 we start with background on our pre-existing 2- stream system and present our new multi-stream framework. In Sec- tion 3 we report on experiments run with the new framework, in- cluding integration with traditional HMM models and audio-visual speech recognition. 2. MULTI-STREAM, MULTI-TAPE FST FRAMEWORK In this section we begin with a description of the 2-stream modeling of landmarks and segments utilized by our baseline speech recog- nizer and then generalize this to arbitrary feature streams allowing asynchrony using a multi-tape FST representation. 2.1. Landmark & Segment Modeling: 2 Streams Our baseline speech recognizer [2] has long made use of both land- mark and segmental acoustic features. Landmarks are proposed with the goal of having them occur at phone boundaries. Segments are proposed with the goal of having them span whole phones. In prac- tice, both landmarks and segments are over-generated, allowing the recognition search to choose the optimal phonetic segmentation. For landmarks this means that some will be proposed internal to phones. For segments this means that a directed acyclic graph is proposed to cover all hypothesized segmentations of the utterance into phones. The landmark models and the segment models operate on sep- arate feature streams that are both derived from the same set of ﬁxed-rate (5ms) MFCC features. The landmark feature stream is I  417 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006