IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 4, NO. 6, DECEMBER 2010 965
Sequential Labeling Using Deep-Structured
Conditional Random Fields
Dong Yu, Senior Member, IEEE, Shizhen Wang, and Li Deng, Fellow, IEEE
Abstract—We develop and present the deep-structured condi-
tional random field (CRF), a multi-layer CRF model in which each
higher layer’s input observation sequence consists of the previous
layer’s observation sequence and the resulted frame-level marginal
probabilities. Such a structure can closely approximate the long-
range state dependency using only linear-chain or zeroth-order
CRFs by constructing features on the previous layer’s output (be-
lief). Although the final layer is trained to maximize the log-like-
lihood of the state (label) sequence, each lower layer is optimized
by maximizing the frame-level marginal probabilities. In this deep-
structured CRF, both parameter estimation and state sequence in-
ference are carried out efficiently layer-by-layer from bottom to
top. We evaluate the deep-structured CRF on two natural language
processing tasks: search query tagging and advertisement field seg-
mentation. The experimental results demonstrate that the deep-
structured CRF achieves word labeling accuracies that are signif-
icantly higher than the best results reported on these tasks using
the same labeled training set.
Index Terms—Conditional random fields (CRFs), deep-struc-
ture, marginal probability, natural language processing, sequential
labeling, word tagging.
I. INTRODUCTION
C
ONDITIONAL random fields (CRFs) have been success-
fully applied to sequential labeling problems, notably
those in natural language processing applications, for several
years [9], [15], [16], [20], [29]. Unlike the hidden Markov
model (HMM), a generative model that describes the joint
probability of the observation data and the class labels, CRFs
are discriminative models that estimate the class label sequence
conditional probabilities directly. In the HMMs, observations
in different frames (e.g., word tokens at different positions) are
assumed to be independent given the state. However, CRFs do
not require this assumption and hence have high flexibility in
choosing features, including those that may not exist in some
frames (i.e., word positions in the natural language processing
tasks) and those that depend on the entire observation sequence.
The most popular CRF for sequential labeling is the linear-
chain CRF depicted in Fig. 1 due to its simplicity and efficiency.
Let us denote by the -frame observation
sequence, and by the corresponding state
Manuscript received August 17, 2009; accepted February 19, 2010. Date of
publication September 13, 2010; date of current version November 17, 2010.
The associate editor coordinating the review of this manuscript and approving
it for publication was Dr. Xiaodong He.
D. Yu and L. Deng are with Microsoft Research, Redmond, WA 98034 USA
(e-mail: dongyu@microsoft.com; deng@microsoft.com).
S. Wang was with the Department of Electrical Engineering, University of
California, Los Angeles, CA 90095 USA. He is now with Microsoft Corpora-
tion, Redmond, WA 98052 USA (e-mail: shizhen@microsoft.com).
Digital Object Identifier 10.1109/JSTSP.2010.2075990
Fig. 1. Graphical representation of the linear-chain CRF, where is the obser-
vation sequence and is the label sequence. The solid and
empty nodes denote the observed and unobserved variables, respectively.
(label) sequence, which can be augmented with a special start
and end state.
In the linear-chain CRFs, the conditional probability of a state
(label) sequence given the observation sequence is given by
(1)
where we have used to represent both the ob-
servation features that provide constraints between
the observation sequence and the state at time , and the state
transition features that provide constraints on the
consecutive states. are the model parameters, and
(2)
is the partition function to normalize the exponential form so
that it becomes a valid probability measure.
The model parameters in the linear-chain CRFs are typi-
cally optimized to maximize the regularized state sequence
log-likelihood
(3)
where is a parameter that balances the log-likelihood and the
regularization term and can be tuned using a development set.
The derivatives of over the model parameters are
given by
(4)
The parameters in the linear-chain CRFs can be efficiently es-
timated using the forward–backward (sum–product) algorithm
[3] along with the optimization algorithms such as generalized
iterative scaling (GIS) [6], gradient ascent, quasi-Newton
1932-4553/$26.00 © 2010 IEEE