2240 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 8, OCTOBER 2012 Structural Classiﬁcation Methods Based on Weighted Finite-State Transducers for Automatic Speech Recognition Yotaro Kubo, Member, IEEE, Shinji Watanabe, Senior Member, IEEE, Takaaki Hori, Member, IEEE, and Atsushi Nakamura, Senior Member, IEEE Abstract—The potential of structural classiﬁcation methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the uniﬁed modeling of acoustic and linguistic aspects of recognizers. However, the structural clas- siﬁcation approaches involve well-known tradeoffs between the richness of features and the computational efﬁciency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted ﬁnite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efﬁcient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classiﬁcation with the WFST-based features; the structured perceptron and conditional random ﬁeld (CRF) techniques. To analyze the ad- vantages of these two classiﬁers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We conﬁrmed that the proposed approach improved the ASR per- formance without sacriﬁcing the computational efﬁciency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE). Index Terms—Automatic speech recognition (ASR), structural classiﬁcation, weighted ﬁnite-state transducers (WFST). I. INTRODUCTION O NE-PASS decoding is an important technique for en- suring the real-time property of systems involving automatic speech recognition (ASR) technologies such as sys- tems for the DARPA TRANSTAC project, meeting recognition systems [1], [2] and closed-captioning systems [3]. To enable such applications, the recent development of ASR decoders Manuscript received January 17, 2012; revised April 19, 2012; accepted April 24, 2012. Date of publication May 11, 2012; date of current version August 09, 2012. This work was supported in part by the Japan Society for the Promotion of Science under Grant-in-Aid Scientiﬁc Research No. 22300064. The associate editor coordinating the review of this manuscript and approving it for publica- tion was Dr. Brian Kingsbury. Y. Kubo, T. Hori, and A. Nakamura are with the NTT Communication Sci- ence Laboratories, NTT Corporation, Kyoto 619-0237, Japan (e-mail: kubo. yotaro@lab.ntt.co.jp; hori.t@lab.ntt.co.jp; nakamura.atsushi@lab.ntt.co.jp). S. Watanabe was with the NTT Communication Science Laboratories, NTT Corporation, Kyoto 619-0237, Japan. He is now with the Mitsubishi Electric Re- search Laboratories, Cambridge, MA 02139 USA (e-mail: shinjiw@ieee.org). Digital Object Identiﬁer 10.1109/TASL.2012.2199112 has led to fast and accurate speech recognition based on hidden Markov model (HMM)-based acoustic models and N-gram language models. However, due to the generative formulations of these acoustic and language models, their performance is not directly maximized in terms of error rates. With the aim of achieving the direct minimization of the error rates, several dis- criminative training methods have been proposed that optimize the parameters of the generative models with respect to the dis- criminative criteria [4]–[11]. Because discriminative training methods preserve the form of the original generative models, the decoding techniques developed for the conventional ASR models are available even though the parameters are trained with these methods. In practice, conventional discriminative training methods train the acoustic and language models separately; however, it is well-known that in theory the discriminative objective function should mutually involve both acoustic and language model parameters. Therefore, several joint training methods have recently been proposed. For example, hidden conditional random ﬁelds (HCRFs) [12] have been introduced to achieve the joint optimization of parameter vectors that can be trans- lated into HMM and N-gram parameters. Chien [13] proposed a joint training method that directly optimizes these generative model parameters based on the maximum entropy principle. However, these studies still use a model structure that is iden- tical to that of conventional HMMs and N-gram models. More speciﬁcally, these methods still involve independent acoustic and language models in their decoding processes even though these are estimated so that the joint performance is maximized. To overcome these restrictions on model structure, several uniﬁed modeling techniques based on structural classiﬁcation methods have been discussed. Unlike the joint training methods, uniﬁed modeling techniques attempt to use expanded model structures to represent and leverage the interdependency of the acoustic and language aspects of ASR. To leverage these in- terdependencies, most methods based on the structural classiﬁ- cation approach attempt to estimate the structural labels using features extracted from both the input and output of the clas- siﬁers, where the input represents an acoustic observation se- quence and the output represents a linguistic symbol sequence in terms of ASR. One of the earliest instances of structural clas- siﬁcation is conditional random ﬁelds (CRFs) [14] where the in- terdependency of sequential inputs and outputs is directly han- dled. Subsequently, structured support vector machines (struc- tured SVMs) are introduced by combining the advantage of the max-margin property of the SVMs and the essence of structural 1558-7916/$31.00 © 2012 IEEE