IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 10, NO. 5, JULY 2002 315 The Integral Decode: A Smoothing Technique for Robust HMM-Based Speaker Recognition Marie Roch, Member, IEEE, and Richard R. Hurtig Abstract—Recent work by Merhav and Lee as well as others has emphasized that the conditions required to make the maximum a posteriori (MAP) decision rule an optimal decision rule for speech recognition do not hold and have proposed techniques based upon the adjustment of model parameters to improve speech recogni- tion. In this article, we consider the problem of text-independent speaker recognition, and present a new model called the integral decode. The integral decode, like previous work in this area, at- tempts to compensate for the lack of conditions necessary to en- sure optimality of the MAP decision rule in environments with cor- rupted observations and imperfect models. The integral decode is a smoothing operation in the feature space domain. A region of uncertainty is established about each noisy ob- servation and an approximation of the integral is computed. The MAP decision rule is then applied to the smoothed likelihood esti- mates. In all tested conditions, the integral decode performs as well as or better than equivalent HMMs without integral decode. Index Terms—Hidden Markov model (HMM), integral decode, maximum a posteriori decision rule, speaker recognition, talker recognition, text-independent speaker identification. I. INTRODUCTION T HE MAXIMUM a posteriori (MAP) decision rule yields an optimal decision rule when accurate models of the distribution are known and the measurements of the sample data being classified are accurate. In hidden Markov model (HMM) based speaker recognition, neither is the case. The measured feature vectors are subject to corruption from transducer, channel, and quantization effects as well as environmental noise. Any of these can contribute to error in both model estimation and testing. In addition to the effects of measurement error and environmental noise, there are likely to be inaccuracies in the model due to lack of sufficient training data (or even absence) for certain phones, transient speaker conditions such as nasal congestion, and long term evolution of the speaker’s voice. Consequently, there is a degree of uncertainty in both the model parameters and the feature vectors of the test utterances. To illustrate the difficulties associated with such uncertainties, let us consider a low dimensional example drawn from the tele- Manuscript received June 5, 2001; revised March 20, 2002. The associate ed- itor coordinating the review of this manuscript and approving it for publication was Dr. Philip C. Loizou. M. Roch was with the Department of Computer Science, University of Iowa, Iowa City, IA 52242 USA and the School of Computer Science, Florida Interna- tional University, Miami, FL 33199 USA. She is now with the School of Com- puter Science, San Diego State University, San Diego, CA 92182-7720 USA (e-mail: marie.roch@ieee.org). R. Hurtig is with the Department of Speech Pathology and Audiology, Uni- versity of Iowa, Iowa City, IA 52242 USA (e-mail: richard-hurtig@uiowa.edu). Publisher Item Identifier 10.1109/TSA.2002.800558. phone quality speech of 26 speakers in the King corpus [1]. The corpus, as well as the methodology used for feature extraction, will be described in Section V. To permit visualization of this problem, only the first two cepstral components (where denotes the transpose operator) will be retained as a feature vector. Single-state semi-continuous HMM’s are trained for each speaker following the methodology outlined in Section V. 1 Five second utterances reserved for testing are decoded against each of the 26 speaker models, and the label associated with the model producing the highest score (the MAP decision) is selected as the speaker category. This results in a 0.723 error rate which is quite poor, but well below the 0.962 chance error rate. Consider a case of confusion between speakers. Every test utterance from speaker 14 was misclassified as one of four other speakers. The most common misclassification was to mislabel speaker 14 test utterances as having been produced by speaker 24. Fig. 1(a) and (b) show the probability density (pdf) functions for speakers 14 and 24. By subtracting the probability density function of speaker 24 from speaker 14, it becomes easy to discern the MAP decision function with respect to these two speakers for any given fea- ture vector. Fig. 1(c) plots the function . For any single observation , a MAP decision func- tion between speakers 14 and 24 is as follows: class speaker speaker otherwise (1) hence, only feature vectors lying in positive regions will be cor- rectly classified. At the base of the figure, contour lines are plotted in the plane to denote the decision boundaries. Speaker 14s feature vectors for five of the test utterances are also plotted in this plane. As can be seen, many of the points lie in negative regions and are misclassified. The large number of these leads to the overall classification failure for speaker 14. Many of the misclassified points lie in proximity to decision boundaries. Given the lack of confidence in the distribution and the measurement, a decision based upon the MAP rule is not likely to be optimal. If a local neighborhood in either the model parameter space or the test feature space can be established, a modified MAP rule which takes into account the uncertainty may be appropriate. The integral decode establishes a region in feature space about each observation and provides a smoothed likelihood 1 The models created differ from those of Section V only with respect to the dimensionality of the feature space and that a smaller number of mixtures (16) are used. 1063-6676/02$17.00 © 2002 IEEE