On Cohort Selection for Speaker Verification Yaniv Zigel and Arnon Cohen Electrical and Computer Engineering Department, Ben-Gurion University, Beer-Sheva, Israel yaniv(arnon)@ee.bgu.ac.il Abstract Speaker verification systems require some kind of background model to reliably perform the verification task. Several algorithms have been proposed for the selection of cohort models to form a background model. This paper proposes a new cohort selection method called the Close Impostor Clustering (CIC). The new method is shown to outperform several other methods in a text-dependent verification task. Several normalization methods are also compared. With three cohort models and the best score- normalization method, the CIC yielded an average Equal Error Rate (EER) of 0.8%, while the second best method (Maximally-Spread Close, MSC) yielded average EER of 1.1%. 1. Introduction The goal of speaker verification systems is to determine whether a given utterance is produced by the claimed speaker or not. This is done by comparing a score, which reflects the match of the given utterance and the claimed speaker’s model, with a threshold. In verification systems based on stochastic models (such as HMM and GMM) the simplest score is the likelihood of the utterance given the claimed speaker’s model. This score is very sensitive to variations in text, speaking behavior, and recording conditions, especially from the non-speaker (impostors) utterances, in both text-independent and text- dependent tasks. This sensitivity causes wide variations in scores, and makes the task of threshold determination a very difficult one. In order to overcome this score’s sensitivity, the use of normalized score, based on cohort speakers (impostors) has been proposed [1 – 6]. Several issues arise with the use of cohorts, among them, the selection of the impostors’ models (the cohort set), the number of the impostors in the cohort set, and the score normalization technique (the normalization function). In this paper, a new method for cohort selection based on speaker clustering is introduced. This selection method is compared with other reported methods. The problem of the order of the cohort set, namely the number of impostors, is also examined. The problem of verification-score normalization is discussed, and results of several score normalization methods are presented. For these, a text- dependent speaker verification based on Hidden Markov Model (HMM) system has been implemented. Supported in part by the EC project: MOUMIR 2. Score Normalization using Cohort Models In verification systems, the decision to accept or reject an identity claim, T, is based on the comparison of a score, s O , with a threshold, : accept s reject O (1) The simplest score for stochastic model based verification systems is the log likelihood, which is the log probability of the (utterances) observations, O, given the target’s (claimed speaker) model, T : log | T s p O O . (2) As was mentioned in the previous section, normalized scores are preferred over the un-normalized score of (2). The most obvious normalization term is probably that of the background model likelihood: | log log | log | | T n T B B p s p p p O O O O O (3) | B p O , known as the normalization term, is the likelihood of the observed vector sequence for a background (filler or "garbage") model. The background model is trained by speakers, other than the target, uttering general text- independent utterances (text-independent tasks) or the T user’s phrase (text-dependent tasks). In other words, | B p O represents a dynamic threshold [2], which is sensitive to variations in O from trial to trial. The main problem with the above normalization term is how to construct a good background model B . Rather than averaging a group of speakers into one "wide" model, it may be better to construct several models from speakers who are close to the claimed speaker in the feature space (these are called "cohort"). In the cohort models idea, the normalization term is estimated only from a group of speakers, CT , whose models are somehow determined to be most "competitive" with the model of the target (claimed) speaker T. 2.1. Cohort Normalized Scores Several score-normalization techniques may be considered. Maybe the most intuitive one is the normalization with the “closest” impostor model: 2 log | max log | T c cCT s p p O O O (4) EUROSPEECH 2003 - GENEVA 1