Migrating i-vectors Between Speaker Recognition Systems Using Regression Neural Networks Ondˇ rej Glembek 1 , Pavel Matˇ ejka 12 , Oldˇ rich Plchot 1 , Jan Peˇ s´ an 1 , Luk´ aˇ s Burget 1 , and Petr Schwarz 12 1 Brno University of Technology, Speech@FIT group and IT4I Centre of Excellence, Czech Republic 2 Phonexia s.r.o., Brno, Czech Republic {glembek,matejkap,iplchot,ipesan,burget,schwarzp}@fit.vutbr.cz Abstract This paper studies the scenario of migrating from one i- vector-based speaker recognition system (SRE) to another, i.e. comparing the i-vectors produced by one system with those pro- duced by another system. System migration would typically be motivated by deploying a system with improved recognition ac- curacy, e.g. because of technological upgrade, or because of the necessity of processing new kind of data, etc. Unfortunately, such migration is very likely to result in the incompatibility between the new and the original i-vectors and, therefore, in the inability of comparing the two. This work studies various topologies of Regression Neural Networks for transforming i- vectors from three different systems so that—with slight loss in the accuracy—they are compatible with the reference system. We present the results on the NIST SRE 2010 telephone condi- tion. Index Terms: speaker recognition, i-vector transformation, Re- gression Neural Networks, system migration 1. Introduction Ever since their introduction in Speaker Recognition, i-vectors have been widely used in multiple fields of speech processing, such as Language Recognition [1], Age Estimation [2, 3], Emo- tion Detection [4], and even in Speech Recognition [5, 6]. The so-called i-vector is an information-rich low-dimensional fixed- length vector extracted from the feature sequence representing a speech segment (see Section 2 for details on i-vector extrac- tion). Due to these properties, the i-vectors are often referred to as audio voice-prints. Let us note that the term voice-print should be taken with care—as has been thoroughly discussed in [7] and [8]—and is only used in this work to denote a possible rep- resentation of an utterance. As such, the i-vectors can be used for audio indexing purposes, information exchange (e.g. foren- sic or intelligence agencies), speaker search, etc. Such usage, however, assumes that the i-vector extraction method (including the parameters of the method) is kept fixed, so that all i-vectors are compatible, and that their direct comparison is feasible. I-vector extraction is a complex process which depends on many sub-tasks, each of which is a subject to continuous re- search aiming at increasing recognition performance. It is very likely, that with every such improvement or change, the i-vector interpretation changes, therefore making it impossible to per- form any direct i-vector comparison. Using a deployed i-vector extraction system—let us refer to it as the reference system— for comparing scoring i-vectors from an alternative or alien sys- tem would therefore require re-extracting the i-vectors for every utterance from the source audio. Let us study an example scenario of a company having a database of i-vectors. For legal, capacity, or other reasons, the company cannot store the corresponding audio files. At a cer- tain point, the company decides to upgrade its i-vector extrac- tion to a newer system (now the “reference”) but would still like to be able to use its existing database of i-vectors (now the “alien-system” generated i-vectors). Another example could be the need of inter-agency “voice-print” exchange; if two agencies use different i-vector extraction methods and want to exchange their i-vectors, there has to be a technique of mapping the alien i-vectors to the reference i-vectors. In this work, we present a technique of computing the mi- gration transformation of the alien i-vectors to the reference i- vectors, provided that, there is a training set of i-vectors gen- erated by both the reference and the alien systems. We study several topologies of Artificial Regression Neural Networks (NN)—with one and two hidden layers, as well as with no hid- den layer, downgrading it to mere linear regression—to trans- form the i-vectors produced by an alien system to be compatible with the reference system. 2. Theoretical Background Let us first take a look at the anatomy of our system. We will then describe the techniques used to transform the i-vectors to fit the reference system. 2.1. Feature extraction In our systems, we used two different core feature extraction— the MFCCs and the Perseus features [9], both described below. Both techniques produce a 20-dimensional feature vector cal- culated every 10ms. This 20-dimensional feature vector was subjected to short time mean- and variance-normalization us- ing a 3 s sliding window. Delta and double delta coefficients were then calculated using a five-frame window giving a 60- dimensional feature vector. Speech/silence segmentation was performed by the BUT Czech phoneme recognizer [10], where all phoneme classes are linked to the speech class. The recognizer was trained on the Czech CTS data, but we have added noise with varying SNR to the 30% of the database. 2.1.1. MFCC In our experiments, we used cepstral features, extracted using a 25 ms Hamming window. We used 24 Mel-filter banks and we limited the bandwidth to the 125–3800Hz range. 19 Mel frequency cepstral coefficients together with zero-th coefficient were calculated every 10 ms. Copyright 2015 ISCA September 6 - 10, 2015, Dresden, Germany INTERSPEECH 2015 2327 10.21437/Interspeech.2015-504