TEXT-DEPENDENT GMM-JFA SYSTEM FOR PASSWORD BASED SPEAKER VERIFICATION Sergey Novoselov 1 , Timur Pekhovsky 1,2 , Andrey Shulipa 1 , Alexey Sholokhov 1,2 1 Speech Technology Center Ltd., St. Petersburg, Russia 2 National Research University of Information Technologies, Mechanics and Optics, Russia {novoselov, tim, shulipa, sholohov}@speechpro.com ABSTRACT We propose a new State-GMM-supervector extractor for solving the problem of text-dependent speaker recognition. The proposed scheme for supervector extraction makes it easy to implement a text-dependent JFA system for passphrase verification. We examine the conditions of both a global and a text-prompted passphrase. The experiments conducted on the Wells Fargo Bank speech database show that the proposed method makes it possible to create more accurate statistical models of speech signals and to achieve a 44% relative reduction of EER compared to the best state-of-the-art systems of text-dependent verification for a text- prompted passphrase . . Index Terms speaker recognition, NAP, SVM, JFA, UBM, GMM, HMM, supervector. 1. INTODUCTION As demonstrated by recent publications, substantial success of state-of-the-art text-dependent verification systems is mostly due to the progress of text-independent speaker recognition. For example, [1-3] use such widely known paradigms as GMM-UBM (Gaussian Mixture Model-Universal Background Model), GMM mean supervector and its МАР (Maximum A Posteriori) adaptation to the speaker model [4]. The idea of hybrid GMM/SVM (Support Vector Machine) [5] systems is also efficiently adapted. These systems use WCCN (Within-Class Covariance Normalization), NAP (Nuisance Attribute Projection) or LDA (Linear Discriminant Analysis) projections of GMM mean supervectors for compensation of channel effects. The JFA (Joint Factor Analysys) method [6-8], is presented in [2] as an attempt to directly apply a traditional text-independent JFA system, trained on large NIST SRE speaker databases, to a text-dependent task. The results in [3] lead us to conclude that if the task of text- depedent verification has to be performed under the conditions of a matched training set, for instance, the Wells Fargo Bank database, then the most successful of all the above-mentioned systems using traditional GMM mean supervectors is the Hidden Markov Model ( HMM ) NAP / SVM system. Moreover, as shown in [9, 10], under the conditions of a matched training set this system outperforms the currently most promising PLDA (Probabilistic Linear Discriminant Analysis) systems for text-dependent verification using i-vectors. This work was supported by the Speech Technology Center, St. Petersburg, Russia In [1-3] the authors do not focus on the causes of the superiority of their HMM-NAP/SVM system to the rest of the systems examined. However, we see two main causes for that. The first one explains the advantage of a HMM-NAP/SVM system over a text-independent JFA system [2]. The reason is that all experiments in [1-3] were conducted on the matched dataset of Wells Fargo Bank, where the recording conditions in the test set closely match the recording conditions in the training set [9]. Under these conditions a system with a small number of parameters trained on this training set can outperform a large text-independent JFA system, which has a large number of parameters but many of them are uninformative because they were trained on NIST SRE datasets. The second one explains the advantage of a HMM-NAP/SVM system over a GMM-NAP/SVM system. In the case when both systems were trained on the same Wells Fargo Bank training dataset, there was a weak overtraining of the HMM-NAP/SVM system because it had more parameters than the GMM-NAP/SVM system. In this paper we find potential for further improvement of systems that use GMM mean supervectors [1-3] and are trained on the Wells Fargo Bank datasets. We propose a new scheme for supervector extraction that makes it easy to implement the idea of a text-dependent JFA whose parameters depend on the states of the passphrase, in contrast to the text-independent variant [2]. We argue that such strengthening of the model will be especially useful under the above-mentioned conditions of Wells Fargo Bank dataset. Section 2 provides a description of state-of-the-art text- dependent verification systems. Section 3 contains the description of the proposed systems. Section 4 describes the system parameters that we use, as well as the Wells Fargo Bank databases. In Section 5 we present comparative experiments using the text-dependent protocol of Wells Fargo Bank and discuss the results. Section 6 concludes the paper. 2. BASELINE SYSTEMS In this section we present a brief description of the best two state- of-the-art systems [1], [3] for text-dependent speaker verification, which we will further refer to as baseline systems. 2.1. GMM-supervector In this paper the baseline GMM system was implemented according to [1]. In this case the GMM mean supervector m of the passphrase was obtained using relevant MAP adaptation [4] of the speaker-independent UBM of this passphrase. Using the results of [3], we chose the best realization of the baseline GMM system, and ML-trained the UBM of the passphrase on the development set of