A Speaker Recognition System for the SITW Challenge Oleg Kudashev 1,2 , Sergey Novoselov 1,2 , Konstantin Simonchik 1,2 , Alexandr Kozlov 2 1 ITMO University, St.Petersburg, Russia 2 Speech Technology Center Ltd., St. Petersburg, Russia {kudashev, novoselov, simonchik, kozlov-a}@speechpro.com Abstract This paper presents an ITMO university system submitted to the Speakers in the Wild (SITW) Speaker Recognition Chal- lenge. During evaluation track of the SITW challenge we explored conventional universal background model (UBM) Gaussian mixture model (GMM) i-vector systems and recently developed DNN-posteriors based i-vector systems. The sys- tems were investigated under the real-world media channel conditions represented in the challenge. This paper discusses practical issues of the robust i-vector systems training and performs investigation of denoising autoencoder (DAE) based back-end when applied to “in the wild” conditions. Our speak- er diarization approach for “multi-speaker in the file” condi- tions is also briefly presented in the paper. Experiments per- formed on the evaluation dataset demonstrate that DNN- based i-vector systems are superior to the UBM-GMM based sys- tems and applying DAE-based back-end helps to improve system performance. Index Terms: SITW, i-vector, DNN, PLDA, DAE. 1. Introduction The Speakers in the Wild (SITW) Speaker Recognition Chal- lenge [1, 2] deals with the task of speaker detection in the unconstrained real-word conditions. The SITW Speaker Recognition Challenge provides database [1] with speech recorded in such conditions. These recordings contain samples of media channels with natural characteristics of the original audio samples such as different noise, reverb, compression and other artifacts. Such varying conditions are expected to be difficult for speaker recognition and the main goal of the chal- lenge is to explore new ideas for solving major problems still faced by current speaker recognition technology and to apply them to the real-world data. Besides “in the wild” recording conditions of the audio da- ta there are several other important aspects of the challenge. The SITW evaluation had: Two tracks: evaluation and exploratory. Three enroll conditions: core, assist, assistclean. Two test conditions: core and multi. Development set: approx. 120 speakers Detailed challenge description is presented in [2]. For many participants the small amount of the ‘in-domain’ media channel development data leads to the necessity of solving the domain mismatch problem in the challenge. The reason is that typically large datasets like NIST SRE datasets of microphone and telephone channels are used for the speaker recognition system training. The recording conditions of these datasets differ a lot from those for datasets of a media channel provided in the SITW challenge. The application of the DNN-based i-vector extraction framework [3, 4, 5] for the speaker recognition task leads to significant performance improvements in comparison to con- ventional UBM-GMM-based systems in telephone channel conditions. However, application of DNN posteriors based systems in case of domain mismatch conditions (e.g. between microphone and telephone channels) comes with its own set of issues [4, 5]. These issues result in overfitting of the system to the specific training conditions. It leads to performance degra- dation of the system. The UBM-GMM-based approach can thus be more convenient in unconstrained conditions of media channels [4]. This work presents the development of different approach- es based on UBM-GMM and DNN when applied to the chal- lenge dataset. Significant attention is paid to practical issues of robust i-vector systems training. The influence of using artifi- cially noised training data for minimization of the mismatch between train and evaluation conditions is studied. In addition to conventional PLDA, a novel back-end based on DAE- PLDA scheme [6, 7] is investigated. In order to solve a speaker recognition task in “assist” and “assistclean” enrollment conditions we proposed an algorithm that applies a speaker diarization framework to extract speech segments of the target speaker based on a small amount of manually annotated material. The final ITMO system for the evaluation track of the SITW challenge is a fusion of different subsystems with prior score stabilization with respect to test and enroll speech seg- ments durations. The paper is organized as follows. A detailed description of the ITMO speaker verification subsystems is given in Sec- tion 2. Section 3 describes the training dataset preparation. Section 4 presents our final experiments on the test dataset of the SITW Challenge. Section 5 concludes the paper. 2. System description In this section we provide a description of all speaker recogni- tion subsystems used in our work. We reviewed a number of existing speaker identification frameworks in order to deter- mine efficient and promising approaches to speaker identifica- tion “in the wild” conditions. 2.1. UBM-GMM i-vector systems The UBM/i-vector framework is a well-known framework in the speaker recognition field. During the SITW challenge we decided to explore two different UBM based i-vector extrac- Copyright 2016 ISCA INTERSPEECH 2016 September 8–12, 2016, San Francisco, USA http://dx.doi.org/10.21437/Interspeech.2016-1197 833