FLEXIBLE MULTI-STREAM FRAMEWORK FOR SPEECH RECOGNITION
USING MULTI-TAPE FINITE-STATE TRANSDUCERS
I. Lee Hetherington, Han Shu, and James R. Glass
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Cambridge, MA 02139, USA
{ilh, hshu, jrg}@csail.mit.edu
ABSTRACT
We present an approach to general multi-stream recognition uti-
lizing multi-tape finite-state transducers (FSTs). The approach is
novel in that each of the multiple “streams” of features can rep-
resent either a sequence (e.g., fixed- or variable-rate frames) or a
directed acyclic graph (e.g., containing hypothesized phonetic seg-
mentations). Each transition of the multi-tape FST specifies the
models to be applied to each stream and the degree of feature stream
asynchrony to allow. We show how this framework can easily rep-
resent the 2-stream variable-rate landmark and segment modeling
utilized by our baseline SUMMIT speech recognizer. We present
experiments merging standard hidden Markov models (HMMs) with
landmark models on the Wall Street Journal speech recognition task,
and find that some degree of asynchrony can be critical when com-
bining different types of models. We also present experiments per-
forming audio-visual speech recognition on the AV-TIMITtask.
1. INTRODUCTION
Most commonly, speech recognition systems utilize a single stream
of features, often a fixed-rate sequence of observation vectors (e.g.,
MFCCs and their derivates) modeled using hidden Markov models
(HMMs). Extensions to this traditional HMM approach include seg-
mental (e.g., whole-phone) modeling [1, 2], multi-stream sub-band
modeling [3], multi-stream multi-rate modeling [4, 5], articulatory-
inspired modeling [6, 7], multi-modal recognition [8], and audio-
visual speech recognition [9, 10, 11], among others. Many of these
multi-stream approaches vary in how often information from the dif-
ferent streams is integrated (e.g., every state, every phone or sylla-
ble boundary, or at the end of the utterance) and whether the initial
search utilizes all streams or rather additional streams are integrated
in a multi-pass approach.
Our SUMMIT speech recognition system [2] has long integrated
two feature streams, landmarks and segments, and integrated them
at phone boundaries in an integrated search. At such phone bound-
aries, determined automatically by the search, the landmark and seg-
ment feature streams are fully synchronized in time. When perform-
ing some initial experiments combining a traditional HMM with our
landmark and segment models, we found that synchronization with
the HMM was an issue. We found that our landmark/segment sys-
tem preferred different phone boundaries as compared to a context-
dependent HMM, and thus desired a framework to explore asyn-
chrony in addition to multiple feature streams. Others have found
Support for this research was provided in part by the National Science
Foundation under grant #IIS-0415865.
that context-dependent HMMs prefer phonetic alignments that may
not well match transcriptions or other models, including context-
independent HMMs [12], and thus allowing some degree of asyn-
chrony between HMMs and other models may be critical to success-
ful integration. In this paper we present our multi-stream framework
that utilizes a multi-tape finite-state transducer (FST) to express how
multiple feature streams are combined and the allowable asynchrony
between them at different parts of the search.
Related work includes multi-stream recognition by HMM re-
combination by Bourlard, Dupont, et al. [3, 9], in which HMMs
representing different streams are allowed to evolve independently
until encountering special synchronization states. The multi-rate
HMM framework of C ¸ etin and Ostendorf [5] utilizes graphical mod-
els and allows different streams to operate at different rates. The
multi-modal approach of Johnston and Bangalore et al. [8] jointly
recognizes gestures and speech using multi-tape FSTs, with integra-
tion of the modalities occurring at the end of the utterance (either
two passes or search through recognition lattices computed on each
modality).
In Section 2 we start with background on our pre-existing 2-
stream system and present our new multi-stream framework. In Sec-
tion 3 we report on experiments run with the new framework, in-
cluding integration with traditional HMM models and audio-visual
speech recognition.
2. MULTI-STREAM, MULTI-TAPE FST FRAMEWORK
In this section we begin with a description of the 2-stream modeling
of landmarks and segments utilized by our baseline speech recog-
nizer and then generalize this to arbitrary feature streams allowing
asynchrony using a multi-tape FST representation.
2.1. Landmark & Segment Modeling: 2 Streams
Our baseline speech recognizer [2] has long made use of both land-
mark and segmental acoustic features. Landmarks are proposed with
the goal of having them occur at phone boundaries. Segments are
proposed with the goal of having them span whole phones. In prac-
tice, both landmarks and segments are over-generated, allowing the
recognition search to choose the optimal phonetic segmentation. For
landmarks this means that some will be proposed internal to phones.
For segments this means that a directed acyclic graph is proposed to
cover all hypothesized segmentations of the utterance into phones.
The landmark models and the segment models operate on sep-
arate feature streams that are both derived from the same set of
fixed-rate (5ms) MFCC features. The landmark feature stream is
I 417 142440469X/06/$20.00 ©2006 IEEE ICASSP 2006