Robust Speaker Recognition Using Spectro-Temporal Autoregressive Models Sri Harish Mallidi 1 , Sriram Ganapathy 2 , Hynek Hermansky 1 1 Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, USA. 2 IBM T.J. Watson Research Center, Yorktown Heights, NY, USA. {mallidi,hynek}@jhu.edu, ganapath@us.ibm.com Abstract Speaker recognition in noisy environments is challenging when there is a mis-match in the data used for enrollment and veri- ﬁcation. In this paper, we propose a robust feature extraction scheme based on spectro-temporal modulation ﬁltering using two-dimensional (2-D) autoregressive (AR) models. The ﬁrst step is the AR modeling of the sub-band temporal envelopes by the application of the linear prediction on the sub-band dis- crete cosine transform (DCT) components. These sub-band en- velopes are stacked together and used for a second AR mod- eling step. The spectral envelope across the sub-bands is ap- proximated in this AR model and cepstral features are derived which are used for speaker recognition. The use of AR models emphasizes the focus on the high energy regions which are rel- atively well preserved in the presence of noise. The degree of modulation ﬁltering is controlled using AR model order param- eter. Experiments are performed using noisy versions of NIST 2010 speaker recognition evaluation (SRE) data with a state- of-art speaker recognition system. In these experiments, the proposed features provide signiﬁcant improvements compared to baseline features (relative improvements of 20% in terms of equal error rate (EER) and 35 % in terms of miss rate at 10 % false alarm). Index Terms: Rate-Scale Filtering, Autoregressive Modeling, Speaker Recognition, Robust Feature Extraction. 1. Introduction Speech technology works reasonably in matched conditions but rapidly degrades when there is acoustic mis-match between the training and test conditions. Although multi-condition training can improve the performance [1], realistic scenarios can beneﬁt from more robustness without requiring training data from the target acoustic environment. In this paper, we develop a feature extraction scheme which attempts to address robustness in noisy and reverberant environments. In the past, various feature processing techniques like spec- tral subtraction [2], Wiener ﬁltering [3] and missing data recon- struction [4] have been developed for noisy speech recognition applications. Feature compensation techniques have also been used in the past for speaker veriﬁcation systems (feature warp- ing [5], RASTA processing [6] and cepstral mean subtraction (CMS) [7]). With noise or reverberation, the low energy val- leys of speech signal have the worst signal to noise ratio (SNR), while the high energy regions are robust and could be well This research was funded by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D10PC20015 and the Ofﬁce of the Director of National Intelligence (ODNI). The authors would also like to acknowledge Brno University of Technology, Xin- hui Zhou and Daniel Garcia-Romero for software fragments. modeled [9]. In general, an autoregressive (AR) modeling ap- proach represents high energy regions with good modeling ac- curacy [10, 11]. The AR modeling approach of signal spectra is widely used for feature extraction of speech [12]. The AR mod- eling of Hilbert envelopes [16, 17] have been used with similar goals of preserving peaks in sub-band temporal envelopes and has been successfully applied for speaker veriﬁcation [27]. 2- D AR modeling was originally proposed for speech recognition by alternating the AR models between spectral and temporal domains [14]. In this paper, we extend our previous approach on two di- mensional AR modeling [15] with a modulation ﬁltering frame- work. Long segments of the input speech signal are decom- posed into sub-bands and linear prediction is applied on the sub-band discrete cosine transform (DCT) components to de- rive Hilbert envelopes [16]. The sub-band envelopes are stacked together to form a time-frequency description and a second AR model is applied across the sub-bands for each short-term frame (25 ms with a shift of 10ms). The output of the second AR model is converted to cepstral coefﬁcients and used for speaker recognition. Modifying either of the AR models, time domain one or the frequency domain one, represents in effect a rate- scale (time-frequency) modulation ﬁltering [18]. The time do- main AR model does the rate ﬁltering and the frequency domain AR model does the scale ﬁltering, similar to the approaches dis- cussed in [19]. Experiments are performed on core conditions of NIST 2010 SRE data [20] with various artiﬁcially added noise and reverberation. In these experiments, the proposed features pro- vides considerable improvements compared to the conventional features. The rest of the paper is organized as follows. Sec. 2 details the proposed feature extraction scheme using 2-D AR models. This is followed by a discussion of various rate-scale feature streams derived from this framework (Sec. 3.1). Sec. 4 describes the experiments on the NIST 2010 SRE. In Sec. 5, we conclude with a brief discussion of the proposed front-end. 2. Feature Extraction The block schematic for the proposed feature extraction is shown in Fig. 1. Long segments of the input speech signal (10s of non-overlapping windows) are transformed using a discrete cosine transform [27]. The full-band DCT signal is windowed into a set of 96 over-lapping linear sub-bands in the frequency range of 125-3700 Hz. In each sub-band, linear prediction is ap- plied on the sub-band DCT components to estimate an all-pole representation of Hilbert envelope [16, 17]. This constitutes the temporal AR modeling stage. The FDLP envelopes from vari- ous sub-bands are stacked together to obtain a two-dimensional representation as shown in Fig. 1. The sub-band envelopes are integrated in short-term frames Copyright  2013 ISCA 25 - 29 August 2013, Lyon, France INTERSPEECH 2013 3689 10.21437/Interspeech.2013-692