A CHANNEL-BLIND SYSTEM FOR SPEAKER VERIFICATION
Najim Dehak
1
, Zahi N. Karam
2,3
, Douglas A. Reynolds
3
,R´ eda Dehak
4
, William M. Campbell
3
,
James R. Glass
1
1
CSAIL at MIT, Cambridge, MA, USA;
2
DSPG, RLE at MIT, Cambridge, MA, USA;
3
MIT Lincoln Laboratory, Lexington, MA, USA;
4
LRDE, Paris, France
ABSTRACT
The majority of speaker verification systems proposed in the NIST
speaker recognition evaluation are conditioned on the type of data to
be processed: telephone or microphone. In this paper, we propose
a new speaker verification system that can be applied to both types
of data. This system, named blind system, is based on an extension
of the total variability framework. Recognition results with the pro-
posed channel-independent system are comparable to state of the art
systems that require conditioning on the channel type. Another ad-
vantage of our proposed system is that it allows for combining data
from multiple channels in the same visualization in order to explore
the effects of different microphones and collection environments.
Index Terms— Total variability space, PLDA, LDA, WCCN.
1. INTRODUCTION
Over the last five years, several channel compensation approaches
were proposed for speaker verification. Hoverer, Joint Factor Analy-
sis (JFA) [1] became one of the more popular approaches. This tech-
nique was proposed in the context of the Gaussian Mixture Model
(GMM) framework in order to model between speaker variability
and to compensate for channel effects. The basic assumption of the
JFA approach is that a high dimensional GMM supervector for a
given utterance can be decomposed into the addition of two parts:
The first part depends on the speaker, which contains the useful in-
formation, and the second depends on the channel, which models the
information that we need to compensate for.
Recently, in [2], we proposed a new speaker verification system
that uses factor analysis techniques for feature extraction rather than
separate speaker and channel modeling, as is done in JFA [1]. In this
new approach, every speech recording is mapped into a single low-
dimensional total variability vector named total factors. Unlike JFA,
there is no distinction between the speaker and intersession variabil-
ities in the GMM supervector space. The channel compensation in
the new approach is carried out in the low-dimensional total variabil-
ity space instead of the GMM supervector space. It is comprised of
a combination of Linear Discriminant Analysis (LDA) and Within
Class Covariance Normalization (WCCN) [2]. The speaker verifica-
tion decision score is obtained using the cosine similarity computed
between the target and test total factors. The total variability space
was first applied in the context of telephone data of the NIST speaker
recognition evaluation. However, an extension of this approach was
also proposed in the context of the microphone data as well [3]. This
This work was sponsored by the Department of Defense under Air Force
contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and rec-
ommendations are those of the authors and are not necessarily endorsed by
the United States Government.
approach consists of stacking extra total factors estimated on the mi-
crophone data to the original telephone total factors. An extension
of the LDA and WCCN combination was also proposed to handle
the interview condition.
In the context of NIST Speaker Recognition Evaluations (SRE)
[4], all proposed speaker verification systems are conditioned on the
type of data (telephone or interview) to be used. In this paper we
propose a new single total variability system, that can be applied si-
multaneously for both telephone and microphone data without prior
knowledge about the data type being processed. This new system is
also based on the total variability space stacking for both telephone
and interview data as proposed in [3]. However, we will show how
Probabilistic Linear Discriminant Analysis (PLDA) [5] can be used
to project both telephone and interview total factors into a common
space. In this new space, an LDA and WCCN combination was also
applied to further compensate for remaining channel effects.
A data visualization technique, first proposed in the context of
speaker verification in [6], is used as both an exploratory and anal-
ysis tool. The technique uses graph embedding, graph layout and
visualization software [7] to visualize all speech utterances within a
data-set of interest. This is done in a manner that groups similar, as
set by the score of the blind system, utterances together. With this
tool we are able to highlight the efficacy of the blind system, as well
as the crucial role of WCCN/LDA in removing channel variability.
2. TOTAL VARIABILITY SPACE
The total variability space proposed in [2] models both the speaker
and channel variabilities simultaneously. It is defined by the total
variability matrix, which contains the eigenvectors with the largest
eigenvalues of the total variability covariance matrix. In this new
model, we make no distinction between the speaker and the chan-
nel effects in the GMM supervector space, as compared to JFA [1]
which does. For a given speech utterance, the speaker- and channel-
dependent GMM supervector is represented by the following equa-
tion
M = m + T
tel
w (1)
where m is the Universal Background Model (UBM) supervector,
the low rank matrix T
tel
defines the total variability space estimated
on telephone speech, and the vector w is the speaker- and session-
dependent factors in the total variability space. The w vectors are
random variables distributed according to the Normal distribution
N (0,I ).
The large success of this new approach on the telephone data of
the NIST-SRE is mainly due to the large amount of telephone data
used to train the total variability matrix T
tel
. An extension of the
total variability space to the interview data of the NIST evaluation
is proposed in [3]. It is based on estimating extra total variability
4536 978-1-4577-0539-7/11/$26.00 ©2011 IEEE ICASSP 2011