Abstract—In this paper, we present a study on Speaker Verifi- cation in time-varying noisy environments. A novel feature ex- traction process suitable for suppression of time varying noise is compared with a fine-tuned spectral subtraction noise suppres- sion front-end. Both techniques are employed to derive enhanced Mel-Frequency Cepstral Coefficients (MFCCs) for a text- independent speaker verification baseline system which has par- ticipated in the 2002 NIST Speaker Recognition Evaluation. The novel feature extraction technique is based on approximating the clean speech spectral magnitude as well as noise spectral magni- tude with a mixture of Gaussian pdfs using the Expectation- Maximization algorithm (EM). Subsequently the Bayesian infer- ence framework is applied to the degraded spectral coefficients and by employing Minimum Mean Square Error Estimation (MMSE), a closed form solution for the spectral magnitude esti- mation task is derived. The estimated spectral magnitude is fi- nally incorporated in the MFCC framework. A comparative study of the proposed technique in a variety of real-world noise types demonstrates a significant performance gain compared to the baseline speech features, and to spectral subtraction en- hancement method. Index Terms-- feature extraction, signal reconstruction, speech enhancement, speaker recognition I. INTRODUCTION LTHOUGH Automatic Speaker Verification (ASV) has reached the state of launching commercial products, real- world environment is still a challenge, due to the acoustic mismatch between training and operational conditions. Gener- ally speaking, ASV methods build speaker models based on available speech corpora gathered in controlled conditions. In short, contemporary ASV systems are composed of a feature extraction stage, which aims at extracting speaker’s characteristics while alleviating linguistic sources of variabil- ity, and a classification stage, that identifies the feature vector with the class of a certain speaker (Fig. 1). The extraction level of current ASV systems converts the input speech signal in a series of multi-dimensional vectors, each corresponding to a short segment of the acoustical speech input. The resulting feature vector makes use of information from all spectrum bands; therefore, any distortion induced to any part of the spectrum is spread to all features forming the vector. The clas- sification stage that is based on the probability density function This work was supported by the “Infotainment management with Speech Interaction via Remote microphones and telephone interfaces” - INSPIRE project (IST-2001-32746). The authors are with the Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, 26500 Rio-Patras, Greece, phone: +30 2610 991722, e-mail: tganchev@wcl.ee.upatras.gr of the acoustic vectors is seriously confused in case of im- paired features. Robust ASV methods often include as a front-end noise suppression techniques acting directly on the speech signal [1]-[2]. A second approach is to extract speech features less sensitive to noise, or to apply feature space transformations that reduce variability due to noise at the parametric stage [3]- [5]. Last but not least, a combination of models for both noise and clean speech is used at the user modelling stage [6] to ac- commodate speaker recognition under noisy environments. Current literature mainly addresses simulated noises or sta- tionary real-world noise sources. However, as we move from laboratory settings to real-world applications it becomes nec- essary to develop more sophisticated techniques to face the complexity of real adverse environments. Our study is focused on telephone driven applications, where a single acquisition channel is available. The main goal is to reduce the effect of acoustic mismatch between testing and training conditions caused by the real time-varying noisy environments, on the speaker verification performance. In this work we integrate a model-based spectral enhancement stage into the MFCC feature extraction process. The spectral en- hancement stage incorporates into Bayesian formulation a- priori information about the long term pdf of each spectral band of an ensemble of clean recordings as well as a model of noise from sample recordings from the operational environ- ment. A mixture of Gaussians is employed to account for the representation of the magnitude of each spectral band of an ensemble of high quality speech (three minutes of phonetically balanced speech from speakers of both genders were found sufficient). The descriptive parameters of each mixture are derived from the observed spectral bands of the clean data by employing the EM algorithm. Subsequently we incorporate a Gaussian mixture model for the background noise and derive the descriptive statistics of the mixtures by the EM algorithm. In the experiments we study the effect of speech distortion, over the speaker verification performance for a number of SNRs, ranging from +20dB to -10dB, for real-world noise types such as factory noise and passing-by aircraft noise. The verification performance is compared with baseline MFCC feature without any pre-processing and with MFCC enhanced by spectral subtraction. II. DESCRIPTION OF THE WCL-1 SYSTEM The text-independent speaker verification system described in this section, has participated in the 2002 NIST Speaker Noise-Source Modelling for Robust Speaker Verification in Adverse Environments Todor Ganchev, Member, IEEE, Ilyas Potamitis, Nikos Fakotakis, Member, IEEE A