LOQUENDO - POLITECNICO DI TORINO’S 2010 NIST
SPEAKER RECOGNITION EVALUATION SYSTEM
Fabio Castaldo*, Daniele Colibro*, Claudio Vair*, Sandro Cumani^ , Pietro Laface^,
Loquendo, Torino, Italy* Politecnico di Torino, Italy^
{first.lastname}@ loquendo. com {first.lastname}@polito.it
ABSTRACT
This paper describes the improvement introduced in the
Loquendo–Politecnico di Torino (LPT) speaker recognition system
submitted to the NIST SRE10 evaluation campaign. This system
combines the results of eight core acoustic systems all based on
Gaussian Mixture Models (GMMs).
We illustrate the key factors, in the selection of the development
data and in engineering state-of-the art technology, which
contributed to the very good performance and calibration of our
system in all the test conditions proposed in this evaluation.
Index Terms—Speaker Recognition, Speaker Segmentation,
Joint Factor Analysis, Total variability models
1. INTRODUCTION
The 2010 Speaker Recognition Evaluation (SRE10) organized by
the National Institute of Standards and Technology (NIST),
focused, as usual, on the speaker detection task, where the goal is
to decide whether a target speaker is speaking in a segment of
conversational speech. System performance is assessed using the
Detection Cost Function (DCF) defined in the evaluation plan [1]
and by means of Detection Error Tradeoff (DET) curves [1].
The main difference of the 2010 evaluation with respect to the
previous ones is that the core test includes speech from telephone
conversations, conversations recorded over a room microphone
channel, and conversational speech from an interview scenario
recorded over a room microphone channel. Some of the telephone
conversations have been collected in a manner to produce
particularly high, or particularly low, speaker vocal efforts.
Moreover, the evaluation of the systems was performed according
to a new Detection Cost Function that severely penalizes false
acceptance costs. SRE10 included 4 training and 3 testing
conditions, but only 9 different test configurations, with different
amounts of speech, such as 10sec, ∼5 minutes (core condition) or
8 conversations, and 2/4 wire recordings. A detailed description of
the data, tasks and rules of SRE10 can be found in the evaluation
plan available in [1].
One of the most important factors for the success of our system
in this evaluation was the use of models obtained by Joint Factor
Analysis (JFA) [3] and by the Total Variability [4] approach,
which perform better than our Feature Domain Compensation
technique [5] at the expense of a higher computational cost. These
two technologies have been exploited to train eight systems,
differing only for the number and type of acoustic features chosen
to generate “complementary” systems: The scores of these systems
are combined and normalized in order to obtain the final scores.
A wise usage of the development data was the second key
factor that allowed our fused systems to obtain a good calibration.
English speaker segments only were selected, the development set
has be extended so that it was possible to reliably estimate the
parameters that optimize the new DCF, and finally, we used only
the interview segments in the SRE08 development subset for
channel compensation, leaving the SRE08 training and test subsets
for back-end estimation and for evaluation. In other words, we
avoid partitioning the SRE08 train and test subsets to set aside
interview speakers segments for channel compensation.
Complying with the new DCF raised new issues on the
normalization and calibration process that has been faced using
Adaptive T-norm [6] and custom development sets with many
impostors.
We submitted results for all the test conditions, including the
summed channel test conditions, where the speaker segments were
obtained by means of the diarization technique presented in [7],
using English trained eigenvectors, rather than multilingual
eigenvectors. This simple replacement has shown to be effective
compared with the best results reported in [8].
2. FEATURE EXTRACTION
Four sets of feature have been extracted for training the models
used in this evaluation, two "small" and two "large". All the
features are subject to short term gaussianization.
The first set (MFCC-25) is the "small" one that was used in the
SRE08 evaluation. It includes 12 Mel Frequency Cepstral
Coefficients (MFCC) plus 13 delta cepstral parameters (ǻc0-ǻc12)
computed every 10 ms. For this set of features, the analysis
bandwidth is 300-3400 Hz, and feature warping to a Gaussian
distribution is performed, for each static parameter stream, on a 3
sec sliding window, excluding silence frames.
All the other feature sets are extracted analyzing the full 0-4000 Hz
bandwidth and feature warping is performed before the voice
activity detection has been applied, thus including silence frames.
The second set of "small" features (PLP-26) includes 13 PLP
coefficients (c0-c12) and their first order derivatives.
The two set of "large" features consist of 60 parameters, 20 MFCC
coefficients (c0-c19) and their first and second order derivatives,
and 20 PLP parameters and their first and second order derivatives.
3. SPEAKER MODELS
For this evaluation we estimated models according to the Joint
Factor Analysis (JFA) and the Total variability approaches, which
allow obtaining accurate models taking into account intersession
variability. Both approaches rely on GMMs estimated from a
Universal Background Model (UBM).
Gender dependent UBMs were trained on telephone data only on
Switchboard II Phases 3, Switchboard Cellular Parts 1 and 2, and
5464 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011