786 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 4, APRIL 2013
A Bottom-Up Modular Search Approach to Large
Vocabulary Continuous Speech Recognition
Sabato Marco Siniscalchi, Member, IEEE, Torbjørn Svendsen, Senior Member, IEEE, and
Chin-Hui Lee, Fellow, IEEE
Abstract—A novel bottom-up decoding framework for large
vocabulary continuous speech recognition (LVCSR) with a mod-
ular search strategy is presented. Weighted finite state machines
(WFSMs) are utilized to accomplish stage-by-stage acoustic-to-
linguistic mappings from low-level speech attributes to high-level
linguistic units in a bottom-up manner. Probabilistic attribute
and phone lattices are used as intermediate vehicles to facilitate
knowledge integration at different levels of the speech knowl-
edge hierarchy. The final decoded sentence is obtained by per-
forming lexical access and applying syntactical constraints. Two
key factors are critical to warrant a high recognition accuracy,
namely: (i) generation of high-precision sets of competing hy-
potheses at every intermediate stage; and (ii) low-error pruning
of unlikely theories to reduce input lattice sizes while maintaining
high-quality hypotheses for the next layers of knowledge integra-
tion. The decoupled nature of the proposed techniques allows
us to obtain recognition results at all stages, including attribute,
phone and word levels, and enables an integration of various
knowledge sources not easily done in the state-of-the-art hidden
Markov model (HMM) systems based on top-down knowledge in-
tegration. Evaluation on the Nov92 test set of the 5000-word, Wall
Street Journal task demonstrates that high-accuracy attribute
and phone classification can be attained. As for word recog-
nition, the proposed WFSM-based framework achieves encour-
aging word error rates. Finally, by combining attribute scores
with the conventional HMM likelihood scores and re-ordering
the -best lists obtained from the word lattices generated with
the proposed WFSM system, the word error rate (WER) can be
further reduced.
Index Terms—Artificial neural network, knowledge integration,
large vocabulary continuous speech recognition (LVCSR), speech
attribute detection, weighted finite state machines (WFSM).
I. INTRODUCTION
S
TATE-OF-THE-ART automatic speech recognition
(ASR) technology is based on a pattern matching
framework that is motivated by expressing spoken utterances
as stochastic patterns [1]. Hidden Markov models (HMMs)
Manuscript received July 02, 2012; revised September 29, 2012; accepted
December 08, 2012. Date of publication December 20, 2012; date of current
version January 18, 2013. The associate editor coordinating the review of this
manuscript and approving it for publication was Prof. Haizhou Li.
S. M. Siniscalchi is with the Faculty of Architecture and Engineering, Univer-
sity of Enna “Kore,” 94100 Enna, Italy (e.mail: marco.siniscalchi@unikore.it).
T. Svendsen is with the Department of Electronics and Telecommunications,
Norwegian University of Science and Technology, 7491 Trondheim, Norway
(e-mail: torbjorn@iet.ntnu.no).
C.-H. Lee is with the School of Electrical and Computer Engineering, Georgia
Institute of Technology, Atlanta, GA 30332 USA (e-mail: chl@ece.gatech.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASL.2012.2234115
(e.g., [2]) have then been used to characterize these speech
patterns, from phones to syllables, words and sentences. A
single finite state network (FSN), composed of acoustic HMM
states of grammar nodes and their connecting arcs [3], is then
constructed to represent all ASR task constraints, known as
top-down knowledge integration. For a given input utterance
ASR is performed by searching the FSN via dynamic pro-
gramming (DP) based optimal decoding (e.g., [4]) to obtain
the most likely sequence of words as the recognized sentence
using maximum a posteriori (MAP) decoding (e.g., [5], [6]).
We will refer to this type of decoding strategy as integrated
search.
This statistical pattern matching approach to ASR relies on
collecting a large amount of speech and text examples and
learning the HMM parameters without the need to use detailed
knowledge about a target language. It offers an advantage for
automatic model learning from data via a rigorous mathemat-
ical formulation. We have witnessed almost four decades on
three major HMM technology advances, namely: (i) detailed
modeling – capable of characterizing thousands of context-de-
pendent phone units with millions of parameters using publicly
available software packages (e.g., HTK [7]); (ii) adaptive
modeling – capable of learning an unseen acoustic condition
with a small amount of condition-specific adaptation data (e.g.,
[6], [8]–[10]); and (iii) discriminative modeling – capable of
obtaining HMMs that are discriminative among competing unit
models (e.g., [11]–[15]).
On the other hand, speech researchers would agree that the
ASR problem is still far from solved due to the degrading perfor-
mance of the state-of-the-art ASR systems in mismatch training
and testing conditions. Furthermore, poor accuracies are ob-
served when dealing with spontaneous speech, where ill-formed
utterances are usually encountered. It is worth noting that the
word error rate (WER) on the Switchboard task [16] has been
reduced to below 20% only very recently [17], and yet this level
of performance is still rather poor when compared with LVCSR
tasks of a similar complexity, e.g., the Wall Street Journal (WSJ)
task [18].
In order to mitigate some of the ASR limitations, we have
seen the utilization of knowledge sources in speech production
(e.g., [19], [20]) and auditory processing and perception (e.g.,
[21]–[23]. Many of them are not easily integrated into the con-
ventional top-down ASR systems. The need for alternative ASR
paradigms that are capable of leveraging on existing speech
literature has thus attracted some research attention in recent
years, and a few significant examples closely related to our work
will be briefly reviewed in Section III. Most of these attempts
1558-7916/$31.00 © 2012 IEEE