LOQUENDO - POLITECNICO DI TORINO’S 2010 NIST SPEAKER RECOGNITION EVALUATION SYSTEM Fabio Castaldo*, Daniele Colibro*, Claudio Vair*, Sandro Cumani^ , Pietro Laface^, Loquendo, Torino, Italy* Politecnico di Torino, Italy^ {first.lastname}@ loquendo. com {first.lastname}@polito.it ABSTRACT This paper describes the improvement introduced in the Loquendo–Politecnico di Torino (LPT) speaker recognition system submitted to the NIST SRE10 evaluation campaign. This system combines the results of eight core acoustic systems all based on Gaussian Mixture Models (GMMs). We illustrate the key factors, in the selection of the development data and in engineering state-of-the art technology, which contributed to the very good performance and calibration of our system in all the test conditions proposed in this evaluation. Index Terms—Speaker Recognition, Speaker Segmentation, Joint Factor Analysis, Total variability models 1. INTRODUCTION The 2010 Speaker Recognition Evaluation (SRE10) organized by the National Institute of Standards and Technology (NIST), focused, as usual, on the speaker detection task, where the goal is to decide whether a target speaker is speaking in a segment of conversational speech. System performance is assessed using the Detection Cost Function (DCF) defined in the evaluation plan [1] and by means of Detection Error Tradeoff (DET) curves [1]. The main difference of the 2010 evaluation with respect to the previous ones is that the core test includes speech from telephone conversations, conversations recorded over a room microphone channel, and conversational speech from an interview scenario recorded over a room microphone channel. Some of the telephone conversations have been collected in a manner to produce particularly high, or particularly low, speaker vocal efforts. Moreover, the evaluation of the systems was performed according to a new Detection Cost Function that severely penalizes false acceptance costs. SRE10 included 4 training and 3 testing conditions, but only 9 different test configurations, with different amounts of speech, such as 10sec, ∼5 minutes (core condition) or 8 conversations, and 2/4 wire recordings. A detailed description of the data, tasks and rules of SRE10 can be found in the evaluation plan available in [1]. One of the most important factors for the success of our system in this evaluation was the use of models obtained by Joint Factor Analysis (JFA) [3] and by the Total Variability [4] approach, which perform better than our Feature Domain Compensation technique [5] at the expense of a higher computational cost. These two technologies have been exploited to train eight systems, differing only for the number and type of acoustic features chosen to generate “complementary” systems: The scores of these systems are combined and normalized in order to obtain the final scores. A wise usage of the development data was the second key factor that allowed our fused systems to obtain a good calibration. English speaker segments only were selected, the development set has be extended so that it was possible to reliably estimate the parameters that optimize the new DCF, and finally, we used only the interview segments in the SRE08 development subset for channel compensation, leaving the SRE08 training and test subsets for back-end estimation and for evaluation. In other words, we avoid partitioning the SRE08 train and test subsets to set aside interview speakers segments for channel compensation. Complying with the new DCF raised new issues on the normalization and calibration process that has been faced using Adaptive T-norm [6] and custom development sets with many impostors. We submitted results for all the test conditions, including the summed channel test conditions, where the speaker segments were obtained by means of the diarization technique presented in [7], using English trained eigenvectors, rather than multilingual eigenvectors. This simple replacement has shown to be effective compared with the best results reported in [8]. 2. FEATURE EXTRACTION Four sets of feature have been extracted for training the models used in this evaluation, two "small" and two "large". All the features are subject to short term gaussianization. The first set (MFCC-25) is the "small" one that was used in the SRE08 evaluation. It includes 12 Mel Frequency Cepstral Coefficients (MFCC) plus 13 delta cepstral parameters (ǻc0-ǻc12) computed every 10 ms. For this set of features, the analysis bandwidth is 300-3400 Hz, and feature warping to a Gaussian distribution is performed, for each static parameter stream, on a 3 sec sliding window, excluding silence frames. All the other feature sets are extracted analyzing the full 0-4000 Hz bandwidth and feature warping is performed before the voice activity detection has been applied, thus including silence frames. The second set of "small" features (PLP-26) includes 13 PLP coefficients (c0-c12) and their first order derivatives. The two set of "large" features consist of 60 parameters, 20 MFCC coefficients (c0-c19) and their first and second order derivatives, and 20 PLP parameters and their first and second order derivatives. 3. SPEAKER MODELS For this evaluation we estimated models according to the Joint Factor Analysis (JFA) and the Total variability approaches, which allow obtaining accurate models taking into account intersession variability. Both approaches rely on GMMs estimated from a Universal Background Model (UBM). Gender dependent UBMs were trained on telephone data only on Switchboard II Phases 3, Switchboard Cellular Parts 1 and 2, and 5464 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011