2240 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 8, OCTOBER 2012
Structural Classification Methods Based on Weighted
Finite-State Transducers for Automatic
Speech Recognition
Yotaro Kubo, Member, IEEE, Shinji Watanabe, Senior Member, IEEE, Takaaki Hori, Member, IEEE, and
Atsushi Nakamura, Senior Member, IEEE
Abstract—The potential of structural classification methods for
automatic speech recognition (ASR) has been attracting the speech
community since they can realize the unified modeling of acoustic
and linguistic aspects of recognizers. However, the structural clas-
sification approaches involve well-known tradeoffs between the
richness of features and the computational efficiency of decoders.
If we are to employ, for example, a frame-synchronous one-pass
decoding technique, features considered to calculate the likelihood
of each hypothesis must be restricted to the same form as the
conventional acoustic and language models. This paper tackles
this limitation directly by exploiting the structure of the weighted
finite-state transducers (WFSTs) used for decoding. Although
WFST arcs provide rich contextual information, close integration
with a computationally efficient decoding technique is still possible
since most decoding techniques only require that their likelihood
functions are factorizable for each decoder arc and time frame. In
this paper, we compare two methods for structural classification
with the WFST-based features; the structured perceptron and
conditional random field (CRF) techniques. To analyze the ad-
vantages of these two classifiers, we present experimental results
for the TIMIT continuous phoneme recognition task, the WSJ
transcription task, and the MIT lecture transcription task. We
confirmed that the proposed approach improved the ASR per-
formance without sacrificing the computational efficiency of the
decoders, even though the baseline systems are already trained
with discriminative training techniques (e.g., MPE).
Index Terms—Automatic speech recognition (ASR), structural
classification, weighted finite-state transducers (WFST).
I. INTRODUCTION
O
NE-PASS decoding is an important technique for en-
suring the real-time property of systems involving
automatic speech recognition (ASR) technologies such as sys-
tems for the DARPA TRANSTAC project, meeting recognition
systems [1], [2] and closed-captioning systems [3]. To enable
such applications, the recent development of ASR decoders
Manuscript received January 17, 2012; revised April 19, 2012; accepted April
24, 2012. Date of publication May 11, 2012; date of current version August 09,
2012. This work was supported in part by the Japan Society for the Promotion
of Science under Grant-in-Aid Scientific Research No. 22300064. The associate
editor coordinating the review of this manuscript and approving it for publica-
tion was Dr. Brian Kingsbury.
Y. Kubo, T. Hori, and A. Nakamura are with the NTT Communication Sci-
ence Laboratories, NTT Corporation, Kyoto 619-0237, Japan (e-mail: kubo.
yotaro@lab.ntt.co.jp; hori.t@lab.ntt.co.jp; nakamura.atsushi@lab.ntt.co.jp).
S. Watanabe was with the NTT Communication Science Laboratories, NTT
Corporation, Kyoto 619-0237, Japan. He is now with the Mitsubishi Electric Re-
search Laboratories, Cambridge, MA 02139 USA (e-mail: shinjiw@ieee.org).
Digital Object Identifier 10.1109/TASL.2012.2199112
has led to fast and accurate speech recognition based on hidden
Markov model (HMM)-based acoustic models and N-gram
language models. However, due to the generative formulations
of these acoustic and language models, their performance is
not directly maximized in terms of error rates. With the aim of
achieving the direct minimization of the error rates, several dis-
criminative training methods have been proposed that optimize
the parameters of the generative models with respect to the dis-
criminative criteria [4]–[11]. Because discriminative training
methods preserve the form of the original generative models,
the decoding techniques developed for the conventional ASR
models are available even though the parameters are trained
with these methods.
In practice, conventional discriminative training methods
train the acoustic and language models separately; however,
it is well-known that in theory the discriminative objective
function should mutually involve both acoustic and language
model parameters. Therefore, several joint training methods
have recently been proposed. For example, hidden conditional
random fields (HCRFs) [12] have been introduced to achieve
the joint optimization of parameter vectors that can be trans-
lated into HMM and N-gram parameters. Chien [13] proposed
a joint training method that directly optimizes these generative
model parameters based on the maximum entropy principle.
However, these studies still use a model structure that is iden-
tical to that of conventional HMMs and N-gram models. More
specifically, these methods still involve independent acoustic
and language models in their decoding processes even though
these are estimated so that the joint performance is maximized.
To overcome these restrictions on model structure, several
unified modeling techniques based on structural classification
methods have been discussed. Unlike the joint training methods,
unified modeling techniques attempt to use expanded model
structures to represent and leverage the interdependency of the
acoustic and language aspects of ASR. To leverage these in-
terdependencies, most methods based on the structural classifi-
cation approach attempt to estimate the structural labels using
features extracted from both the input and output of the clas-
sifiers, where the input represents an acoustic observation se-
quence and the output represents a linguistic symbol sequence
in terms of ASR. One of the earliest instances of structural clas-
sification is conditional random fields (CRFs) [14] where the in-
terdependency of sequential inputs and outputs is directly han-
dled. Subsequently, structured support vector machines (struc-
tured SVMs) are introduced by combining the advantage of the
max-margin property of the SVMs and the essence of structural
1558-7916/$31.00 © 2012 IEEE