COMPARING MAXIMUM A POSTERIORI VECTOR QUANTIZATION AND GAUSSIAN MIXTURE MODELS IN SPEAKER VERIFICATION * Tomi Kinnunen, Juhani Saastamoinen, Ville Hautam¨ aki, Mikko Vinni, Pasi Fr¨ anti Speech and Image Processing Unit (SIPU), Dept. of Computer Science and Statistics University of Joensuu, P.O. Box 111, FI-80101 Joensuu, FINLAND E-mail: {tkinnu,juhani,villeh,mvinni,franti}@cs.joensuu.fi ABSTRACT Gaussian mixture model - universal background model (GMM- UBM) is a standard reference classifier in speaker verification. We have recently proposed a simplified model using vector quantization (VQ-UBM). In this study, we extensively compare these two clas- sifiers on NIST 2005, 2006 and 2008 SRE corpora, while having a standard discriminative classifier (GLDS-SVM) as a reference point. We focus on parameter setting for N-top scoring, model order, and performance for different amounts of training data. The most in- teresting result, against a general belief, is that GMM-UBM yields better results for short segments whereas VQ-UBM is good for long utterances. The results also suggest that maximum likelihood train- ing of the UBM is sub-optimal, and hence, alternative ways to train the UBM should be considered. Index TermsSpeaker verification, MFCCs, Gaussian mixture model (GMM), vector quantization (VQ), MAP training 1. INTRODUCTION Typical speaker verification systems use mel-frequency cepstral coefficients (MFCCs) to parameterize speech signal. Feature extrac- tion is followed by speaker modeling, for which two approaches have been dominant in the 21st century: generative modeling based on maximum a posteriori (MAP) adaptation of a speaker-independent universal background model (UBM) [1, 2], and discriminative mod- eling based on the concept of separating hyperplane [3, 4]. Latest solutions also use a so-called eigenchannel transformation and joint factor analysis (JFA) to reduce the effects of channel and session variability in the speaker models [5]. We use MFCCs and focus on the speaker modeling by Gaussian mixture model with UBM (GMM-UBM) [1], vector quantizer with UBM (VQ-UBM) [2] and generalized linear discriminant sequence support vector machine (GLDS-SVM) [3]. We set the following lim- itations in order to keep the baseline simple: (1) we use only tele- phone data for background modeling, (2) we do not use any inter- session variability compensation, (3) we do not make use of ASR component, (4) we do not make use of language information, (5) we do not use additional score normalization such as T-norm [6]. More complete systems used in recent NIST speaker recognition evalua- tions use such techniques in conjunction with each other. Our simpli- fications allow us to focus more deeply on the modeling component, but on the other hand, weaken the overall performance in compari- son to more complete systems, especially for non-telephony data. * EXTENDED VERSION OF THE PAPER HAS BEEN ACCEPTED FOR PUBLICATION IN PATTERN RECOGNITION LETTERS. Vector quantization speaker modeling was popular in the 1980s and 1990s [7, 8], but after the introduction of the background model concept for GMMs [1], GMM has been the dominant approach. Even so, usually only the mean vectors of the GMM are adapted while using shared (co)variances and weights for all speakers. This raises a question whether the variances and weights are needed at all. To answer this question, we derived MAP adaptation algorithm for the VQ model [2] as a special case of the MAP adaptation for GMM, involving only the centroid vectors. The VQ approach achieves speed-up in training compared to GMM with comparable accuracy. In this paper, we further explore the inherent differences of the GMM-UBM and the VQ-UBM classifiers in the speaker verification task, while having the GLDS-SVM classifier as a reference point. The results presented here are based on our submissions to NIST 2006 and NIST 2008 speaker recognition evaluations. We focus on parameter setting for fast N-top scoring, model order, performance for different amounts of training data and effects of mismatched data. In [2], our main focus was in formal derivation of the algorithm rather than in extensive testing. This paper serves for that latter pur- pose. Since the VQ-model has less free parameters to be estimated, it may be hypothesized that VQ-based classifier will outperform GMM for small amounts of data; see, for instance, [9] for such an obser- vation. This hypothesis is probably true if both models are trained using maximum likelihood (mean square error minimization). How- ever, it is less clear how the situation changes when using MAP training. In this paper, we will show surprising experimental ev- idence that suggests the opposite: GMM-UBM is better for short utterances whereas VQ-UBM outperforms GMM-UBM when the length of training and test data increases. We discuss the possible reasons for this and its implications. 2. SYSTEM DESCRIPTION 2.1. Feature Extraction and Classifier Training The MFCCs are extracted from 30 msec Hamming-windowed frames with 50 % overlap. We use 12 MFCCs computed via 27-channel mel-frequency filterbank. The MFCC trajectories are smoothed with RASTA filtering, followed by appending of the Δ and Δ 2 features. The last two steps are voice activity detection (VAD) and utterance-level mean and variance normalization in that order. For the VAD, we use an energy-based algorithm that uses file-dependent detection threshold based on maximum energy level. GMM-UBM system follows the standard implementation with diagonal covariance matrices [1]. We use two gender-dependent UBMs trained by deterministic splitting method, followed by seven