COMPENSATION OF EXTRINSIC VARIABILITY IN SPEAKER VERIFICATION SYSTEMS
ON SIMULATED SKYPE AND HF CHANNEL DATA
Korbinian Riedhammer Tobias Bocklet Elmar N¨ oth
Lehrstuhl f ¨ ur Mustererkennung, Universit¨ at Erlangen-N ¨ urnberg
Martensstraße 3, 91058 Erlangen, GERMANY
korbinian.riedhammer@informatik.uni-erlangen.de
ABSTRACT
In this work we focus on speaker verification on channels of varying
quality, namely Skype and high frequency (HF) radio. In our setup,
we assume to have telephone recordings of speakers for training, but
recordings of different channels for testing with varying (lower) si-
gnal quality. Starting from a Gaussian mixture / support vector ma-
chine (GMM/SVM) baseline, we evaluate multi-condition training
(MCT), an ideal channel classification approach (ICC), and nuisance
attribute projection (NAP) to compensate for the loss of informati-
on due to the transmission. In an evaluation on Switchboard-2 data
using Skype and HF channel simulators, we show that, for good si-
gnal quality, NAP improves the baseline system performance from
5% EER to 3.33% EER (for both Skype and HF). For strongly dis-
torted data, MCT or, if adequate, ICC turn out to be the method of
choice.
Index Terms— speaker verification, channel compensation
1. INTRODUCTION
The task of speaker verification describes the two-class problem of
detecting speakers who pretend to be someone else, so-called im-
postors. In addition to the traditional scenario where speaker veri-
fication is applied to recordings from the telephone system or room
microphones, other channels of communication draw more attention,
e.g., Skype (http://www.skype.com) as a very popular (free)
voice-over-IP service or HF radio for long range communication as
for military, nautical or aviation purposes.
The state-of-the-art is to model a speaker by Gaussian mix-
ture models (GMM) [1] that are estimated by features extracted
from a spoken utterance, typically Mel frequency cepstrum coeffi-
cients (MFCCs). Our framework uses a universal background model
(UBM) representing a set of background speakers. This UBM is
then adapted to speaker specific models using maximum a posteriori
(MAP) adaptation [2]. The mean values of these models represent
each target speaker in a high-dimensional space. In a next step, for
each training speaker, a support vector machine (SVM) is trained
where the UBM is employed as imposter model. The classifica-
tion task is to determine whether a test speaker is closer to the
background speakers or to the target speaker [3].
One major problem in this scenario is session variability which
contains both, extrinsic and intrinsic speaker variations [4]. This has
been addressed by different techniques at different system levels.
On feature level, feature mapping (FM) [5] can be used to reduce
the effect of different channels. On model level, transformations like
nuisance attribute projection (NAP) [6] or joint factor analysis (JFA)
[7] can be applied. While FM and NAP do not handle the two kinds
of variability differently, JFA tries to model extrinsic and intrinsic
variations jointly.
1
In this work we keep the intrinsic variations constant and fo-
cus on a “controlled” variation of the extrinsic factors, i.e., recor-
ding channel or codec differences. This is achieved by applying va-
rious channel simulators to a set of given telephone data. from the
Switchboard-2 [8] corpus. We exemplarily simulate high-frequency
(HF) recordings in various quality levels as defined by the CCIR [9]
and Skype codec compression in various quality settings using the
Skype API. For the latter, the variations are in packet loss and in bit
rate. Note that, for Skype, we focus on the actual simulated audio
data and not the encrypted stream as for example in [10].
In this work, we evaluate four types of systems, all based on the
previously described GMM/SVM architecture.
1. As a baseline, we train a GMM/SVM system using the origi-
nal telephone data, and test it on both, the original and simu-
lated data. This system is confronted with a strong acosutic
mismatch between training and test conditions.
2. An ideal channel classification system, i.e., we train an in-
dividual system for each channel setting using the simulated
training data and test on the respective simulated test data.
This results in one system trained specifically on each chan-
nel configuration. This system is designed to have the least
mismatch in training and test.
3. A general GMM/SVM system trained in a multi-condition
manner, all simulated data for each recording of the training
set are employed to train multi-condition speaker models. The
system is then tested on all test data.
4. A state-of-the-art intersession variability (ISV) compensation
GMM/SVM system where a NAP transformation is estima-
ted on various simulated quality settings of the training recor-
dings. This results in a system with speaker models trained
solely on the original (telephone) data but transformed into a
“channel-free” space. The system is applied to all simulation
conditions of the test data.
This article is structured as follows. After a brief introduction
of the data and the channel simulation in Section 2, we describe the
different speaker verification systems in Section 3. The results of
the different systems are analyzed in Sec. 4. We conclude with a
summary and an outlook in Sec. 5.
1
As we use channel simulators to obtain several versions of the same re-
cordings thus eliminating the speaker variability, we chose the computatio-
nally easier NAP for this work.
4840 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011