COMPENSATION OF EXTRINSIC VARIABILITY IN SPEAKER VERIFICATION SYSTEMS ON SIMULATED SKYPE AND HF CHANNEL DATA Korbinian Riedhammer Tobias Bocklet Elmar N¨ oth Lehrstuhl f ¨ ur Mustererkennung, Universit¨ at Erlangen-N ¨ urnberg Martensstraße 3, 91058 Erlangen, GERMANY korbinian.riedhammer@informatik.uni-erlangen.de ABSTRACT In this work we focus on speaker veriﬁcation on channels of varying quality, namely Skype and high frequency (HF) radio. In our setup, we assume to have telephone recordings of speakers for training, but recordings of different channels for testing with varying (lower) si- gnal quality. Starting from a Gaussian mixture / support vector ma- chine (GMM/SVM) baseline, we evaluate multi-condition training (MCT), an ideal channel classiﬁcation approach (ICC), and nuisance attribute projection (NAP) to compensate for the loss of informati- on due to the transmission. In an evaluation on Switchboard-2 data using Skype and HF channel simulators, we show that, for good si- gnal quality, NAP improves the baseline system performance from 5% EER to 3.33% EER (for both Skype and HF). For strongly dis- torted data, MCT or, if adequate, ICC turn out to be the method of choice. Index Terms— speaker veriﬁcation, channel compensation 1. INTRODUCTION The task of speaker veriﬁcation describes the two-class problem of detecting speakers who pretend to be someone else, so-called im- postors. In addition to the traditional scenario where speaker veri- ﬁcation is applied to recordings from the telephone system or room microphones, other channels of communication draw more attention, e.g., Skype (http://www.skype.com) as a very popular (free) voice-over-IP service or HF radio for long range communication as for military, nautical or aviation purposes. The state-of-the-art is to model a speaker by Gaussian mix- ture models (GMM) [1] that are estimated by features extracted from a spoken utterance, typically Mel frequency cepstrum coefﬁ- cients (MFCCs). Our framework uses a universal background model (UBM) representing a set of background speakers. This UBM is then adapted to speaker speciﬁc models using maximum a posteriori (MAP) adaptation [2]. The mean values of these models represent each target speaker in a high-dimensional space. In a next step, for each training speaker, a support vector machine (SVM) is trained where the UBM is employed as imposter model. The classiﬁca- tion task is to determine whether a test speaker is closer to the background speakers or to the target speaker [3]. One major problem in this scenario is session variability which contains both, extrinsic and intrinsic speaker variations [4]. This has been addressed by different techniques at different system levels. On feature level, feature mapping (FM) [5] can be used to reduce the effect of different channels. On model level, transformations like nuisance attribute projection (NAP) [6] or joint factor analysis (JFA) [7] can be applied. While FM and NAP do not handle the two kinds of variability differently, JFA tries to model extrinsic and intrinsic variations jointly. 1 In this work we keep the intrinsic variations constant and fo- cus on a “controlled” variation of the extrinsic factors, i.e., recor- ding channel or codec differences. This is achieved by applying va- rious channel simulators to a set of given telephone data. from the Switchboard-2 [8] corpus. We exemplarily simulate high-frequency (HF) recordings in various quality levels as deﬁned by the CCIR [9] and Skype codec compression in various quality settings using the Skype API. For the latter, the variations are in packet loss and in bit rate. Note that, for Skype, we focus on the actual simulated audio data and not the encrypted stream as for example in [10]. In this work, we evaluate four types of systems, all based on the previously described GMM/SVM architecture. 1. As a baseline, we train a GMM/SVM system using the origi- nal telephone data, and test it on both, the original and simu- lated data. This system is confronted with a strong acosutic mismatch between training and test conditions. 2. An ideal channel classiﬁcation system, i.e., we train an in- dividual system for each channel setting using the simulated training data and test on the respective simulated test data. This results in one system trained speciﬁcally on each chan- nel conﬁguration. This system is designed to have the least mismatch in training and test. 3. A general GMM/SVM system trained in a multi-condition manner, all simulated data for each recording of the training set are employed to train multi-condition speaker models. The system is then tested on all test data. 4. A state-of-the-art intersession variability (ISV) compensation GMM/SVM system where a NAP transformation is estima- ted on various simulated quality settings of the training recor- dings. This results in a system with speaker models trained solely on the original (telephone) data but transformed into a “channel-free” space. The system is applied to all simulation conditions of the test data. This article is structured as follows. After a brief introduction of the data and the channel simulation in Section 2, we describe the different speaker veriﬁcation systems in Section 3. The results of the different systems are analyzed in Sec. 4. We conclude with a summary and an outlook in Sec. 5. 1 As we use channel simulators to obtain several versions of the same re- cordings thus eliminating the speaker variability, we chose the computatio- nally easier NAP for this work. 4840 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011