Improving robustness in open set speaker identiﬁcation by shallow source modelling M. Zamalloa 1,2 , L.J. Rodr´ ıguez 1 , M. Pe˜ nagarikano 1 , G. Bordel 1 , J.P. Uribe 2 (1) GTTS, Departamento de Electricidad y Electr´ onica, Universidad del Pa´ ıs Vasco (2) Ikerlan – Centro de Investigaciones Tecnol´ ogicas mzamalloa001@ikasle.ehu.es Abstract Open set speaker identiﬁcation consists of deciding whether an input utterance corresponds to a target speaker or to an impos- tor. The most likely among a set of target speakers is hypoth- esized and veriﬁed. Speaker veriﬁcation is performed by com- paring the likelihood score of the most likely speaker model to the likelihood score of an impostor model, and then applying a suitable threshold. The most common approach to modelling impostors is the Universal Background Model (UBM). For the UBM to be effective, it must be estimated from a large number of speakers. However, it is not always possible to gather enough data to estimate a robust UBM, and the veriﬁcation performance may degrade if impostors, or whatever sources that generate the input signals, were not suitably modelled by the UBM. In this paper, a simple approach is proposed which estimates a shal- low source model (SSM) based on the input utterance, and then uses this SSM to normalize the speaker score. Though the SSM does not outperform the UBM, the combination of both models improves the recognition performance and drastically increases the robustness to signals not covered by the UBM. 1. Introduction Closed-set speaker identiﬁcation can be easily performed by ﬁrst training acoustic models for a set of target speakers and then selecting the most likely speaker for each input utterance. But open-set speaker identiﬁcation involves speaker veriﬁca- tion, that is, deciding whether the input utterance was actually produced by the most likely speaker or by an impostor. This task may arise in smart non-intrusive environments which must be permanently aware of the potential users, reacting in differ- ent ways, with different allowed functionalities, depending on the detected user. If an impostor was detected, the smart envi- ronment may automatically block its functionalities or alert the system supervisor. Another interesting application is speaker tracking in broadcast news: the audio signal is segmented into homogeneous sections (usually speaker turns), which must be automatically labelled either with the name of a target speaker or with the name of a default category corresponding to un- known speakers and other sources (music, noise, etc.). Whatever the application, speaker data are available for a set of target speakers, and speaker models can be trained on them. Though speaker characteristics are reﬂected at many levels (acoustic, phonetic, phonological, prosodic, syntactic or even pragmatic), and all of them may help the identiﬁcation task [1], most systems take into account only the physiologi- cal information conveyed by the acoustic parameters, and use an acoustic model to gather the statistics of the power spec- trum speciﬁc to each speaker. Once the acoustic models λs are estimated for the set of speakers s =1,...,S, each in- put utterance X, which consists of a sequence of acoustic vec- tors X = {x1,x2,...,xT }, is classiﬁed by selecting the most likely speaker ˆ s. Applying the Bayes rule, assuming that all the speakers have equal prior probabilities and the acoustic obser- vations are independent, and taking logarithms, it follows: ˆ s = arg max s=1,...,S P (λs|X) = arg max s=1,...,S P (X|λs )P (λs) = arg max s=1,...,S log P (X|λs ) = arg max s=1,...,S T X t=1 log p(xt|λs) (1) The acoustic pdf p(x|λ) is usually implemented by a Gaus- sian Mixture Model (GMM) [2]. Once the most likely speaker ˆ s is determined, veriﬁcation may be done by comparing the aver- age log-likelihood score L(X|λ ˆ s )= 1 T P T t=1 log p(xt|λ ˆ s ) to a speaker-dependent threshold τ (ˆ s). The normalizing term 1/T is needed to allow applying a length-independent threshold. But the likelihood score not only depends on the speaker but also on many non-speaker utterance-speciﬁc variations, so deﬁning a threshold is not a solution, even if we deﬁne speaker-dependent thresholds. To compensate the effect of non-speaker utterance-speciﬁc variability and, simultaneously, to allow applying a speaker- independent threshold τ , speaker scores are normalized by the likelihood score of an impostor model λ ˆ s,I : Λ(X, ˆ s)= L(X|λ ˆ s ) -L(X|λ ˆ s,I ) (2) In the framework of an open set speaker identiﬁcation task, the input utterance X is assigned the label ˆ s if Λ(X, ˆ s) >τ ; otherwise, X is taken as an impostor utterance. The decision threshold τ can be heuristically adjusted to trade-off the false acceptance and the false rejection errors. In this case, a false ac- ceptance error corresponds to accepting an impostor as a target speaker, and false rejection errors correspond either to taking a target speaker as an impostor or to taking a target speaker A as the target speaker B (in brief, false rejection errors correspond to missing target speakers). Various alternatives have been proposed in the literature to deﬁne a suitable model for impostors λs,I . A possible solution consists of using a cohort of background speakers [3]. Back- ground speakers are, in fact, known speakers selected according to a given criterion of closeness, remoteness, competitiveness or the like, with regard to the target speaker. A speaker model is estimated for each background speaker, so that the likelihood Odyssey 2008: The Speaker and Language Recognition Workshop Stellenbosch, South Africa January 21 -24, 2008 ISCA Archive http://www.isca-speech.org/archive