Discriminative Adaptation for Speaker Verification C. Longworth and M. J. F. Gales Engineering Department, Cambridge University Trumpington St, Cambridge, CB2 1PZ {cl336,mjfg}@eng.cam.ac.uk Abstract Speaker verification is a binary classification task to determine whether a claimed speaker uttered a phrase. Current approaches to speaker verification tasks typically involve adapting a general speaker Universal Background Model (UBM), normally a Gaus- sian Mixture Model (GMM), to model a particular speaker. Ver- ification is then performed by comparing the likelihoods from the speaker model to the UBM. Maximum A-Posteriori (MAP) is commonly used to adapt the UBM to a particular speaker. How- ever speaker verification is a classification task. Thus, robust discriminative-based adaptation schemes should yield gains over the standard MAP approach. This paper describes and evaluates two discriminative approaches to speaker verification. The first is a discriminative version of MAP based on Maximum Mutual In- formation (MMI-MAP). The second is to use an augmented-GMM (A-GMM) as the speaker-specific model. The additional, aug- mented, parameters are discriminatively, and robustly, trained us- ing a maximum margin estimation approach. The performance of these models is evaluated on the NIST 2002 SRE dataset. Though no gains were obtained using MMI-MAP, the A-GMM system gave an Equal Error Rate (EER) of 7.31%, a 30% relative reduc- tion in EER compared to the best performing GMM system. 1. Introduction Gaussian-mixture models (GMM) have become the dominant ap- proach for modeling acoustic features in text-independent, speaker verification systems[1]. The standard approach is to train a GMM on all the available speaker data and use this as a Universal Back- ground Model (UBM) to represent all speakers. This UBM is then adapted to the limited amount of enrolment data for a particular speaker, Maximum A-Posteriori adaptation is the usual approach to allow the large number of GMM components in the UBM to be robustly adapted to a speaker. However speaker verification is in- herently a classification task. Hence discriminative approaches to robustly adapting the general UBM to the specific speaker have the capability to yield gains over the standard MAP approach. Most previous discriminative approaches have concentrated on the use of Support Vector Machines (SVMs) with kernels that handle the dynamic nature of the speaker verification task, example kernels include generative kernels [2], the Kullback-Leibler kernel [3] and the sequence kernel [4]. All these approaches generate decision boundaries in a score-space rather than discriminatively adapting the speaker models. This paper describes and evaluates two differ- ent discriminative approaches for speaker adaptation. The first discriminative adaptation approach is based on a dis- criminative MAP scheme which has been found to work well for C. Longworth would like to thank the Schiff Foundation for funding. both task and gender adaptation in automatic speech recognition (ASR) [5]. Robust MAP estimates are obtained using the Maxi- mum Mutual Information (MMI) criterion, rather than Maximum Likelihood (ML). This approach can be viewed as maximising the posterior of the correct speaker compared to all other speakers. The second approach uses an augmented GMM (A-GMM) as the speaker-specific model. Here the standard MAP adapted speaker model is augmented by a local exponential approximation. The parameters of this augmentation, the augmented parameters, are estimated using maximum margin training, a discriminative ap- proach [6]. Maximum margin estimation schemes should yield ro- bust estimates even on limited data. This form of model is closely related to the verification work in [2]. However the approach here is from an adaptation perspective, rather than using an SVM to generate a decision boundary in a generative score-space. This has the advantage that the posterior can be computed for any observa- tion allowing simple combination with other statistical approaches if desired [6]. Such probabilistic interpretations are not normally possible with SVMs. This paper is structured as follows. The next section briefly describes statistical approaches to speaker verification and the two discriminative approaches investigated in this paper. In section 3, experimental results on the 2002 NIST speaker recognition evalu- ation dataset are presented. Finally, conclusions are drawn. 2. Discriminative Adaptation The standard approaches used for speaker verification are based on Bayes’ decision rule. Here to decide whether speaker s uttered O the following decision rule is applied log (P (ωs|O; λ)) accept > < reject β (1) β is a threshold used to set false accepts and false rejects and λ are the model parameters for all S speakers . As generative mod- els, GMMs, are usually used, Bayes’ rule can be applied to obtain the posterior of the class given models for all speakers. However, rather than using a combined speaker model in the denominator, which assumes a closed set, a UBM is usually trained on all the speakers and used instead. This simple approximation is faster and yields performance gains. The UBM is also used as the prior distribution for the MAP estimates of the speaker-specific parame- ters, λ (s) [1]. This section describes how discriminative adaptation schemes may be used to obtain the speaker-specific models.