SPEAKER VERIFICATION WITHOUT BACKGROUND SPEAKER MODELS Chun-Nan Hsu, Hau-Chung Yu, Bo-Hou Yang Institute of Information Science, Academia, Sinica, Nankang 115, Taipei City, Taiwan chunnan@iis.sinica.edu.tw ABSTRACT Speaker verification concerns the problem of verifying whether a given utterance has been pronounced by a claimed authorized speaker. This problem is important because an accurate speaker verification system can be applied to many security applications. In this paper, we present a new algorithm for speaker verification called OSCILLO. By applying tolerance interval analysis in statistics, OSCILLO can verify a speaker’s ID without background speaker models . This greatly reduces the space requirement of the system and the time for both training and verification. Experimental results show that OSCILLO can achieve error rates comparable or better than the GMM-based system with background speaker models for three benchmark databases: TCC-300, TIMIT and NIST 2000. 1. INTRODUCTION Speaker recognition is the process of automatically recognizing who is speaking on the basis of information obtained from speech waves. This technique will make it possible to verify the identity of the person accessing the system, that is, implementing access control by voice in various applications. There are two categories of the problems in speaker recognition [7] --- speaker identification and speaker verification. Speaker identification deals with a close-set classification problem, where the systems are supposed to assign the unknown speaker’s identity to one of a set of known speakers, while speaker verification systems must accurately reject an unknown speaker, and therefore, must deal with an open-set classification problem [4]. Previously, Reynolds et al. proposed a speaker verification system using Gaussian mixture models (GMM) [6][8]. Their systems require a set of background speaker models , which are constructed from a large speech database of speakers with a variety of demographic backgrounds that match the population of the potential imposter speakers. In many real world situations, however, it may not be feasible to obtain such a database. Moreover, there is no systematic approach to the construction of background speaker models. The proposed approaches in [6] and [8] are heuristic and depend heavily on the quality of the speech database. The need of background speaker models also makes many security applications of speaker verification infeasible. For examp le, it is highly desirable on the market if portable devices such as cellular phones, and PDAs can be equipped with a speaker verification system so that these devices can only be accessed by their owner. But it is not cost effective if the device must carry a large database for constructing background speaker models. The other option is to let the users submit their speech sample to some site where training will be performed, but that will deter users with privacy concerns. In this paper, we present a new algorithm for speaker verification without background speaker models. With this new algorithm, a speaker verification system can be trained using only the speech samples from the claimed speaker. This greatly reduces the space requirement and the time for both training and verification, and thus, makes many security applications feasible. This paper is organized as follows: Section 2 reviews the Gaussian mixture model for speaker recognition; Section 3 introduces tolerance interval analysis; Section 4 describes the baseline system of speaker verification, a GMM-based algorithm proposed in previous work by Reynolds et al. [6]; Section 5 presents our algorithm OSCILLO; Section 6 reports the experimental results that compare OSCILLO and the baseline system; the last section contains our conclusion. 2. GUASSIAN MIXTURE MODEL Mixture models are a type of the density model that comprises of a number of component functions, usually Gaussian. These component functions are combined to provide a multimodal density. The distributions of feature vectors were extracted from a speaker’s speech modeled by a Gaussian mixture density. This is a method that has been proved to be one of the most successful approaches to close-set, text-independent speaker identification [5]. A Gaussian mixture density is a weighted sum of M component densities, and is given by the equation ∑ = = M i i i x b p x p 1 ) ( ) | ( r r λ (1)