IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 965 Sequential Labeling Using Deep-Structured Conditional Random Fields Dong Yu, Senior Member, IEEE, Shizhen Wang, and Li Deng, Fellow, IEEE Abstract—We develop and present the deep-structured condi- tional random ﬁeld (CRF), a multi-layer CRF model in which each higher layer’s input observation sequence consists of the previous layer’s observation sequence and the resulted frame-level marginal probabilities. Such a structure can closely approximate the long- range state dependency using only linear-chain or zeroth-order CRFs by constructing features on the previous layer’s output (be- lief). Although the ﬁnal layer is trained to maximize the log-like- lihood of the state (label) sequence, each lower layer is optimized by maximizing the frame-level marginal probabilities. In this deep- structured CRF, both parameter estimation and state sequence in- ference are carried out efﬁciently layer-by-layer from bottom to top. We evaluate the deep-structured CRF on two natural language processing tasks: search query tagging and advertisement ﬁeld seg- mentation. The experimental results demonstrate that the deep- structured CRF achieves word labeling accuracies that are signif- icantly higher than the best results reported on these tasks using the same labeled training set. Index Terms—Conditional random ﬁelds (CRFs), deep-struc- ture, marginal probability, natural language processing, sequential labeling, word tagging. I. INTRODUCTION C ONDITIONAL random ﬁelds (CRFs) have been success- fully applied to sequential labeling problems, notably those in natural language processing applications, for several years [9], [15], [16], [20], [29]. Unlike the hidden Markov model (HMM), a generative model that describes the joint probability of the observation data and the class labels, CRFs are discriminative models that estimate the class label sequence conditional probabilities directly. In the HMMs, observations in different frames (e.g., word tokens at different positions) are assumed to be independent given the state. However, CRFs do not require this assumption and hence have high ﬂexibility in choosing features, including those that may not exist in some frames (i.e., word positions in the natural language processing tasks) and those that depend on the entire observation sequence. The most popular CRF for sequential labeling is the linear- chain CRF depicted in Fig. 1 due to its simplicity and efﬁciency. Let us denote by the -frame observation sequence, and by the corresponding state Manuscript received August 17, 2009; accepted February 19, 2010. Date of publication September 13, 2010; date of current version November 17, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Xiaodong He. D. Yu and L. Deng are with Microsoft Research, Redmond, WA 98034 USA (e-mail: dongyu@microsoft.com; deng@microsoft.com). S. Wang was with the Department of Electrical Engineering, University of California, Los Angeles, CA 90095 USA. He is now with Microsoft Corpora- tion, Redmond, WA 98052 USA (e-mail: shizhen@microsoft.com). Digital Object Identiﬁer 10.1109/JSTSP.2010.2075990 Fig. 1. Graphical representation of the linear-chain CRF, where is the obser- vation sequence and is the label sequence. The solid and empty nodes denote the observed and unobserved variables, respectively. (label) sequence, which can be augmented with a special start and end state. In the linear-chain CRFs, the conditional probability of a state (label) sequence given the observation sequence is given by (1) where we have used to represent both the ob- servation features that provide constraints between the observation sequence and the state at time , and the state transition features that provide constraints on the consecutive states. are the model parameters, and (2) is the partition function to normalize the exponential form so that it becomes a valid probability measure. The model parameters in the linear-chain CRFs are typi- cally optimized to maximize the regularized state sequence log-likelihood (3) where is a parameter that balances the log-likelihood and the regularization term and can be tuned using a development set. The derivatives of over the model parameters are given by (4) The parameters in the linear-chain CRFs can be efﬁciently es- timated using the forward–backward (sum–product) algorithm [3] along with the optimization algorithms such as generalized iterative scaling (GIS) [6], gradient ascent, quasi-Newton 1932-4553/$26.00 © 2010 IEEE