IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002 315
The Integral Decode: A Smoothing Technique for
Robust HMM-Based Speaker Recognition
Marie Roch, Member, IEEE, and Richard R. Hurtig
Abstract—Recent work by Merhav and Lee as well as others has
emphasized that the conditions required to make the maximum a
posteriori (MAP) decision rule an optimal decision rule for speech
recognition do not hold and have proposed techniques based upon
the adjustment of model parameters to improve speech recogni-
tion. In this article, we consider the problem of text-independent
speaker recognition, and present a new model called the integral
decode. The integral decode, like previous work in this area, at-
tempts to compensate for the lack of conditions necessary to en-
sure optimality of the MAP decision rule in environments with cor-
rupted observations and imperfect models.
The integral decode is a smoothing operation in the feature space
domain. A region of uncertainty is established about each noisy ob-
servation and an approximation of the integral is computed. The
MAP decision rule is then applied to the smoothed likelihood esti-
mates. In all tested conditions, the integral decode performs as well
as or better than equivalent HMMs without integral decode.
Index Terms—Hidden Markov model (HMM), integral decode,
maximum a posteriori decision rule, speaker recognition, talker
recognition, text-independent speaker identification.
I. INTRODUCTION
T
HE MAXIMUM a posteriori (MAP) decision rule
yields an optimal decision rule when accurate models
of the distribution are known and the measurements of the
sample data being classified are accurate. In hidden Markov
model (HMM) based speaker recognition, neither is the
case. The measured feature vectors are subject to corruption
from transducer, channel, and quantization effects as well as
environmental noise. Any of these can contribute to error in
both model estimation and testing. In addition to the effects of
measurement error and environmental noise, there are likely to
be inaccuracies in the model due to lack of sufficient training
data (or even absence) for certain phones, transient speaker
conditions such as nasal congestion, and long term evolution of
the speaker’s voice.
Consequently, there is a degree of uncertainty in both the
model parameters and the feature vectors of the test utterances.
To illustrate the difficulties associated with such uncertainties,
let us consider a low dimensional example drawn from the tele-
Manuscript received June 5, 2001; revised March 20, 2002. The associate ed-
itor coordinating the review of this manuscript and approving it for publication
was Dr. Philip C. Loizou.
M. Roch was with the Department of Computer Science, University of Iowa,
Iowa City, IA 52242 USA and the School of Computer Science, Florida Interna-
tional University, Miami, FL 33199 USA. She is now with the School of Com-
puter Science, San Diego State University, San Diego, CA 92182-7720 USA
(e-mail: marie.roch@ieee.org).
R. Hurtig is with the Department of Speech Pathology and Audiology, Uni-
versity of Iowa, Iowa City, IA 52242 USA (e-mail: richard-hurtig@uiowa.edu).
Publisher Item Identifier 10.1109/TSA.2002.800558.
phone quality speech of 26 speakers in the King corpus [1]. The
corpus, as well as the methodology used for feature extraction,
will be described in Section V.
To permit visualization of this problem, only the first two
cepstral components (where denotes the transpose
operator) will be retained as a feature vector. Single-state
semi-continuous HMM’s are trained for each speaker following
the methodology outlined in Section V.
1
Five second utterances
reserved for testing are decoded against each of the 26 speaker
models, and the label associated with the model producing the
highest score (the MAP decision) is selected as the speaker
category. This results in a 0.723 error rate which is quite poor,
but well below the 0.962 chance error rate.
Consider a case of confusion between speakers. Every test
utterance from speaker 14 was misclassified as one of four other
speakers. The most common misclassification was to mislabel
speaker 14 test utterances as having been produced by speaker
24. Fig. 1(a) and (b) show the probability density (pdf) functions
for speakers 14 and 24.
By subtracting the probability density function of speaker 24
from speaker 14, it becomes easy to discern the MAP decision
function with respect to these two speakers for any given fea-
ture vector. Fig. 1(c) plots the function
. For any single observation , a MAP decision func-
tion between speakers 14 and 24 is as follows:
class
speaker
speaker otherwise
(1)
hence, only feature vectors lying in positive regions will be cor-
rectly classified. At the base of the figure, contour lines are
plotted in the plane to denote the decision boundaries.
Speaker 14s feature vectors for five of the test utterances are
also plotted in this plane. As can be seen, many of the points lie
in negative regions and are misclassified. The large number of
these leads to the overall classification failure for speaker 14.
Many of the misclassified points lie in proximity to decision
boundaries. Given the lack of confidence in the distribution and
the measurement, a decision based upon the MAP rule is not
likely to be optimal. If a local neighborhood in either the model
parameter space or the test feature space can be established, a
modified MAP rule which takes into account the uncertainty
may be appropriate.
The integral decode establishes a region in feature space
about each observation and provides a smoothed likelihood
1
The models created differ from those of Section V only with respect
to the dimensionality of the feature space and that a smaller number of
mixtures (16) are used.
1063-6676/02$17.00 © 2002 IEEE