A CHANNEL-BLIND SYSTEM FOR SPEAKER VERIFICATION Najim Dehak 1 , Zahi N. Karam 2,3 , Douglas A. Reynolds 3 ,R´ eda Dehak 4 , William M. Campbell 3 , James R. Glass 1 1 CSAIL at MIT, Cambridge, MA, USA; 2 DSPG, RLE at MIT, Cambridge, MA, USA; 3 MIT Lincoln Laboratory, Lexington, MA, USA; 4 LRDE, Paris, France ABSTRACT The majority of speaker verication systems proposed in the NIST speaker recognition evaluation are conditioned on the type of data to be processed: telephone or microphone. In this paper, we propose a new speaker verication system that can be applied to both types of data. This system, named blind system, is based on an extension of the total variability framework. Recognition results with the pro- posed channel-independent system are comparable to state of the art systems that require conditioning on the channel type. Another ad- vantage of our proposed system is that it allows for combining data from multiple channels in the same visualization in order to explore the effects of different microphones and collection environments. Index TermsTotal variability space, PLDA, LDA, WCCN. 1. INTRODUCTION Over the last ve years, several channel compensation approaches were proposed for speaker verication. Hoverer, Joint Factor Analy- sis (JFA) [1] became one of the more popular approaches. This tech- nique was proposed in the context of the Gaussian Mixture Model (GMM) framework in order to model between speaker variability and to compensate for channel effects. The basic assumption of the JFA approach is that a high dimensional GMM supervector for a given utterance can be decomposed into the addition of two parts: The rst part depends on the speaker, which contains the useful in- formation, and the second depends on the channel, which models the information that we need to compensate for. Recently, in [2], we proposed a new speaker verication system that uses factor analysis techniques for feature extraction rather than separate speaker and channel modeling, as is done in JFA [1]. In this new approach, every speech recording is mapped into a single low- dimensional total variability vector named total factors. Unlike JFA, there is no distinction between the speaker and intersession variabil- ities in the GMM supervector space. The channel compensation in the new approach is carried out in the low-dimensional total variabil- ity space instead of the GMM supervector space. It is comprised of a combination of Linear Discriminant Analysis (LDA) and Within Class Covariance Normalization (WCCN) [2]. The speaker verica- tion decision score is obtained using the cosine similarity computed between the target and test total factors. The total variability space was rst applied in the context of telephone data of the NIST speaker recognition evaluation. However, an extension of this approach was also proposed in the context of the microphone data as well [3]. This This work was sponsored by the Department of Defense under Air Force contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and rec- ommendations are those of the authors and are not necessarily endorsed by the United States Government. approach consists of stacking extra total factors estimated on the mi- crophone data to the original telephone total factors. An extension of the LDA and WCCN combination was also proposed to handle the interview condition. In the context of NIST Speaker Recognition Evaluations (SRE) [4], all proposed speaker verication systems are conditioned on the type of data (telephone or interview) to be used. In this paper we propose a new single total variability system, that can be applied si- multaneously for both telephone and microphone data without prior knowledge about the data type being processed. This new system is also based on the total variability space stacking for both telephone and interview data as proposed in [3]. However, we will show how Probabilistic Linear Discriminant Analysis (PLDA) [5] can be used to project both telephone and interview total factors into a common space. In this new space, an LDA and WCCN combination was also applied to further compensate for remaining channel effects. A data visualization technique, rst proposed in the context of speaker verication in [6], is used as both an exploratory and anal- ysis tool. The technique uses graph embedding, graph layout and visualization software [7] to visualize all speech utterances within a data-set of interest. This is done in a manner that groups similar, as set by the score of the blind system, utterances together. With this tool we are able to highlight the efcacy of the blind system, as well as the crucial role of WCCN/LDA in removing channel variability. 2. TOTAL VARIABILITY SPACE The total variability space proposed in [2] models both the speaker and channel variabilities simultaneously. It is dened by the total variability matrix, which contains the eigenvectors with the largest eigenvalues of the total variability covariance matrix. In this new model, we make no distinction between the speaker and the chan- nel effects in the GMM supervector space, as compared to JFA [1] which does. For a given speech utterance, the speaker- and channel- dependent GMM supervector is represented by the following equa- tion M = m + T tel w (1) where m is the Universal Background Model (UBM) supervector, the low rank matrix T tel denes the total variability space estimated on telephone speech, and the vector w is the speaker- and session- dependent factors in the total variability space. The w vectors are random variables distributed according to the Normal distribution N (0,I ). The large success of this new approach on the telephone data of the NIST-SRE is mainly due to the large amount of telephone data used to train the total variability matrix T tel . An extension of the total variability space to the interview data of the NIST evaluation is proposed in [3]. It is based on estimating extra total variability 4536 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011