Blind Determination of the Signal to Noise Ratio of Speech Signals Based on Estimation Combination of Multiple Features Russell Ondusko, Matthew Marbach, Andrew McClellan, Ravi P. Ramachandran, Linda M. Head Rowan University Correspondence: ravi@rowan.edu Mark C. Huggins Lockheed-Martin Mark.Huggins@rl.af.mil Brett Y. Smolenski Research Associates for Defense Conversion Brett.Smolenski@rl.af.mil Abstract— A blind approach for estimating the signal to noise ratio (SNR) of a speech signal corrupted by additive noise is proposed. The method is based on a pattern recognition paradigm using various linear predictive based features, a vector quantizer classiﬁer and estimation combination. Blind SNR estimation is very useful in speaker identiﬁcation systems in which a conﬁdence metric is determined along with the speaker identity. The conﬁdence metric is partially based on the mismatch between the training and testing conditions of the speaker identiﬁcation system and SNR estimation is very important in evaluating the degree of this mismatch. The aim is to correctly estimate SNR values from 0 to 30 dB, a range that is both practical and crucial for speaker identiﬁcation systems. Additive white Gaussian noise and pink noise are investigated. The best feature for both white and pink noise is the vector of reﬂection coefﬁcients which achieves an average SNR estimation error of 1.6 dB and 1.85 dB for white and pink noise respectively. Combining the estimates of 4 features lowers the error for white noise to 1.46 dB and for pink noise to 1.69 dB. I. I NTRODUCTION Consider a speech signal corrupted by additive noise that is statistically independent of the signal. This noisy signal is char- acterized by a signal to noise ratio (SNR) calculated over the entire duration of the signal. In this paper, a pattern recognition approach using various linear predictive (LP) [1] derived features is used to blindly estimate the SNR of the noisy speech signal. Blind estimation of the SNR is very useful in closed set speaker identiﬁcation systems. The training of a speaker identiﬁcation system involves the conﬁguration of M models each representing a different speaker. During closed set testing, the features of an utterance are compared to the M models to render a decision of the speaker identity as being one of the M speakers [2][3]. Recent research has been done to develop techniques to calculate a conﬁdence metric to accompany the decision of the speaker identity [4][5]. The conﬁdence metric is calculated based on the mismatch between training and testing conditions, amount of training and testing data, and number of speakers (value of M). As M increases, there is usually more model overlap. The more the difference between the SNR of the training and testing speech, the more the mismatch between the two and the lower the conﬁdence metric. An automatic and blind method of SNR estimation of the training and testing speech is an integral part of the technique of ﬁnding the conﬁdence metric of a speaker identiﬁcation system. The method proposed for blind SNR estimation is based on a pattern recognition paradigm just like what is used for speaker identiﬁcation. Features based on LP analysis that would not be robust to noise are highly useful candidates for SNR estimation as they show differences for varying noise levels. The overall system consists of four components, namely, (1) Linear predictive (LP) analysis, (2) Feature extraction for ensuring SNR discrimination, (3) Vector quantizer (VQ) classiﬁer and decision logic for computing the SNR estimate and (4) Combination of the SNR estimates of the different features to get a ﬁnal estimate. During training, a VQ codebook is trained for each distinct SNR value using feature vectors obtained from noisy speech corresponding to that particular SNR. During testing, the input to the system will be a noisy speech signal with an unknown SNR. After LP analysis and feature extraction, the set of feature vectors will be passed through each VQ codebook to get an overall distance for each codebook. Based on these distances, the output will be an estimated SNR value. A VQ classiﬁer is trained separately for each feature and leads to an SNR estimate for each feature. A comparison of different LP based features is done with respect to the average absolute error between the actual and estimated SNR. The features considered [1][6][7] include the line spectral frequencies (LSFs), reﬂection coefﬁcients (REFL), log area ratios (LAR), linear predictive cepstrum (CEP), adaptive component weighted cepstrum (ACW) and the postﬁlter cepstrum (PFL). The SNR estimates of the individual features are combined to get an even better estimate in that the average absolute error is further reduced. II. FEATURE EXTRACTION Linear predictive analysis results in a stable all-pole model 1/A(z) of order p where A(z)=1 - p  n=1 a(n)z -n (1) The autocorrelation method of LP analysis gives rise to the predic- tor coefﬁcients a(n) and the REFL feature refl(n) for n =1 to p. The LAR feature is found as lar(n) = log  1 - refl(n) 1+ refl(n)  (2) for n =1 to p. The LSF feature lsf (n) are the angles (between 0 and π) of the alternating unit circle roots of F (z) and G(z) [1] where F (z) = A(z)+ z -(p+1) A(z -1 ) G(z) = A(z) - z -(p+1) A(z -1 ) (3) The predictor coefﬁcients a(n) are converted to the LP cepstrum clp(n)(n ≥ 1) by an efﬁcient recursive relation [1] clp(n)= a(n)+ n-1  i=1 ( i n )clp(i)a(n - i) (4) Since clp(n) is of inﬁnite duration, the CEP feature vector of dimension p consists of the components clp(1) to clp(p) which 1–4244–0387–1/06/$20.00 c  2006 IEEE APCCAS 2006 1897