DISCRIMINATIVE AUTOENCODERS FOR SPEAKER VERIFICATION Hung-Shin Lee 1,2 Yu-Ding Lu 3 Chin-Cheng Hsu 2 Yu Tsao 3 Hsin-Min Wang 2 Shyh-Kang Jeng 1 1 Department of Electrical Engineering, National Taiwan University, Taiwan 2 Institute of Information Science, Academia Sinica, Taiwan 3 Research Center for Information Technology Innovation, Academia Sinica, Taiwan ABSTRACT This paper presents a learning and scoring framework based on neural networks for speaker verification. The framework employs an autoencoder as its primary structure while three factors are jointly considered in the objective function for speaker discrimination. The first one, relating to the sample reconstruction error, makes the structure essentially a gener- ative model, which benefits to learn most salient and useful properties of the data. Functioning in the middlemost hidden layer, the other two attempt to ensure that utterances spoken by the same speaker are mapped into similar identity codes in the speaker discriminative subspace, where the dispersion of all identity codes are maximized to some extent so as to avoid the effect of over-concentration. Finally, the decision score of each utterance pair is simply computed by cosine similarity of their identity codes. Dealing with utterances represented by i-vectors, the results of experiments conducted on the male portion of the core task in the NIST 2010 Speaker Recogni- tion Evaluation (SRE) significantly demonstrate the merits of our approach over the conventional PLDA method. Index Terms— autoencoders, speaker verification, dis- criminative training, neural networks, PLDA 1 Introduction Even though the methodology of deep neural networks (DNNs) has seemingly become more and more popular in the field of speaker recognition and obtained some gains in performance [1, 2, 3, 4, 5, 6], either total variability model- ing (i-vector) [7] or probabilistic linear discriminant analysis (PLDA) [8], as well as their modifications, are still indis- pensable and robust ingredients in most of current speaker verification systems. The aim of i-vector is to represent variable-length speech signals by fixed-size vectorial tokens while the session/channel variabilities induced by various sources are compensated and the speaker characteristics are abundantly and ulteriorly preserved [9]. Given two i-vectors, the task of PLDA is to linearly discriminate between speak- ers in a low-rank subspace and give a reasonable metric to measure their decision score in a probabilistic sense [10, 11]. Actually, going through the latest three ICASSP proceed- ings (2014-16), more than three-fourths of papers dealing with speaker verification use PLDA as one of their scor- ing backends, among which there are much fewer efforts to either develop their own competitive algorithms for back- end speaker discrimination or make some improvements to PLDA by standing on its shoulders. For example, Rohdin et al. gave a discriminative PLDA training algorithm, where some constraints are imposed on the derivation of the speaker variability matrix [12]. Similar to the work by Lee et al. [13], Cumani and Laface employed pairwise support vector machines (SVMs) to efficiently classify pairs of i-vectors as belonging or not to the same speaker even with large-scale datasets [14]. Nautsch et al. proposed a PLDA-alike ap- proach with restricted Boltzmann machines (RBM), which aims at suppressing channel effects and recovering speaker- discriminative information on a small dataset [15]. Most recently, Heigold et al. used DNNs and long short term mem- ory (LSTM) to represent utterances and directly map each trial set of utterances to a decision score for verification [16]. In this paper, we replace the role of PLDA with an au- toencoder and tweak its objectives for speaker discrimina- tion. The autoencoder is a symmetric neural network that is trained to approximately copy its input to the output [17]. Besides the reconstruction error, which makes the autoen- coder analogous to a generative model that benefits to un- supervisedly learn most salient and useful properties of the data, two more objective functions are concerned in our pro- posed framework. They attempt to ensure that utterances spo- ken by the same speaker would have similar identity codes (i-codes) in the speaker-discriminative subspace represented by the middlemost hidden layer, where the scatterness of all i-codes are also maximized to some extent to avoid the effect of over-concentration. Finally, the decision score of each ut- terance pair is simply computed by cosine similarity of their i-codes. To our best knowledge, despite the autoencoder has been widely applied to many speech processing tasks, such as speech enhancement [18, 19], acoustic novelty detection [20], and reverberant speech recognition [21], much less pa- pers used it directly for speaker recognition. Most important of all, our contributions are two-fold. First, we present a kind of neural network-based discriminant analysis, which consumedly and nonlinearly extends the ca- pability of PLDA. Second, our proposed model is immune to