A data-driven approach to optimizing spectral speech enhancement methods for various error criteria Jan Erkelens * , Jesper Jensen, Richard Heusdens Department of Mediamatics, Information and Communication Theory Group, Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands Received 31 January 2006; received in revised form 26 June 2006; accepted 27 June 2006 Abstract Gain functions for spectral noise suppression have been derived in literature for some error criteria and statistical models. These gain functions are only optimal when the statistical model is correct and the speech and noise spectral variances are known. Unfortunately, the speech distributions are unknown and can at best be determined conditionally on the estimated spectral variance. We show that the ‘‘decision-directed’’ approach for speech spectral variance estimation can have an important bias at low SNRs, which generally leads to too much speech suppression. To correct for such estimation inaccuracies and adapt to the unknown speech statistics, we propose a gen- eral optimization procedure, with two gain functions applied in parallel. A conventional algorithm is run in the background and is used for a priori SNR estimation only. For the ﬁnal reconstruction a diﬀerent gain function is used, optimized for a wide range of signal-to- noise ratios. The gain function providing for the reconstruction is trained on a speech database, by minimizing a relevant error criterion. The procedure is illustrated for several error criteria. The method compares favorably to current state-of-the-art methods, and needs less smoothing in the decision-directed spectral variance estimator. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Speech enhancement; Spectral distortion measures; Speech model 1. Introduction Single-microphone speech enhancement is important for many applications (Benesty et al., 2005). Techniques in the short-time Fourier domain are often used, because they are fast, perform well and the statistical modeling in the fre- quency domain is simple. Minimum mean-square error (MMSE) estimators of the spectral amplitudes (Ephraim and Malah, 1984) or log spectral amplitudes (Ephraim and Malah, 1985), based on the assumption of a Rayleigh distribution for the amplitudes, are commonly used, but more general distribution assumptions have been made as well (Lotter and Vary, 2005), and also estimators based on super-Gaussian distributions for the real and imaginary parts of the Fourier coeﬃcients have been proposed, such as Laplace and Gamma distributions (Martin, 2005a). The latter methods are data-driven methods in the sense that optimal estimators are derived for distribution models that ﬁt to observed speech distributions. The analytical der- ivation of estimators is only possible for some error criteria and certain statistical models. Porter and Boll (1984) used a data-driven method to calculate estimators directly from clean speech. In their work no estimator of spectral vari- ance was used and the estimators were adapted to global speech distributions only. Perhaps the most commonly used estimator of speech spectral variance is the ‘‘decision-directed’’ variance esti- mator (Ephraim and Malah, 1984). It greatly reduces an annoying artefact of spectral enhancement methods, called ‘‘musical noise’’ (Cappe ´, 1994). The decision-directed esti- mator combines the estimated amplitude of the previous analysis frame with the noisy amplitude of the current frame into one estimator of the spectral variance. Although it reduces the musical noise, it may lead to smoothing of 0167-6393/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2006.06.012 * Corresponding author. Tel.: +31 15 2785859; fax: +31 15 2781843. E-mail address: j.s.erkelens@tudelft.nl (J. Erkelens). URL: http://www-ict.ewi.tudelft.nl/ (J. Erkelens). www.elsevier.com/locate/specom Speech Communication 49 (2007) 530–541