SPEECH ENHANCEMENT BY PERCEPTUAL FILTER WITH SEQUENTIAL NOISE PARAMETER ESTIMATION Te-Won Lee andKaisheng Yao Institute for Neural Computation, Univ. of California at San Diego 9500 Gilman Drive, La Jolla, CA 92093-0523 tewon@ucsd.edu, kyao@ucsd.edu ABSTRACT We report a work on speech enhancement that combines sequen- tial noise estimation and perceptual ﬁltering. The sequential es- timation employs an extension of the sequential EM-type algo- rithm. In this algorithm, statistics of clean speech are modeled by hidden Markov models (HMM) and noise is assumed to be Gaus- sian distributed with a time-varying mean vector (the noise pa- rameter) to be estimated. The estimation process uses a non-linear function that relates speech statistics, noise, and noisy observa- tion. With the estimated noise parameter, subtraction-type algo- rithm for speech enhancement may be extended to non-stationary environments. In particular, a perceptual ﬁlter with frequency masking is constructed with a tradeoff between noise reduction and speech distortion considering the sensitivity of speech recog- nition systems to speech distortion. Our experiments in speech enhancement and speech recognition in non-stationary noise con- ﬁrmed that this approach seems promising in improving perfor- mances compared to alternative speech enhancement algorithms. 1. INTRODUCTION The goal of speech enhancement is to recover original speech sig- nals from noisy observations, and has been greatly studied in the past decades [1]. Traditional methods [2][3] usually assume that the statistics of the contaminating noise is known to the enhance- ment system. In the simplest manner, the noise statistics can be modeled by a simple Gaussian density, which assumes that noise statistics is constant. More detailed modeling of noise statistics may be done by using Gaussian mixture models (GMM). This assumption requires sufﬁcient amount of noise data to learn the noise statistics. Unfortunately, the assumption may not hold in re- alistic environments, where noise statistics may differ from those during training, thus limiting the performance of these methods. More recently, researchers have started to investigate speech enhancement in time-varying noisy environments. Proposed meth- ods assume a parametric function to relate speech and background noise, and use sequential methods, e.g., sequential Monte Carlo [4] and Bayesian inference [5]. These methods usually use HMM to model clean speech statistics and a simple noise model with pa- rameters to be estimated from noisy speech. This paper presents a method for speech enhancement within the above approach but makes the following contributions. First, for the purpose of sequential noise parameter estimation, it is beneﬁcial to have algorithms with fast convergence rate and low computational requirements. Since the noise parameter estima- tion process involves estimation of hidden speech mixtures/states, (deterministic or stochastic) EM-type algorithms have to be used. This paper applies the sequential Kullback proximal algorithm (SKP) [6], which is a sequential version of the Kullback proxi- mal algorithm [7]. The Kullback proximal algorithm can achieve faster convergence rate than the normal EM algorithm. Moreover, the computational requirement for the SKP algorithm is much less than some alternative methods [4][5]. Second contribution is a subtraction-type speech enhancement algorithm that makes use of the estimated noise statistics. The subtraction-type algorithm is designed to consider a tradeoff between noise reduction and speech distortion, as both have inﬂuences on speech recognition system performances. The tradeoff may be achieved by retaining a certain amount of residual noise in the enhanced speech sig- nals. We suggest to employ human auditory properties [8] for the design of the subtraction-type algorithm. Although human auditory properties have been applied to some previous methods, e.g., [3], these previous methods may not be able to handle time- varying noise [3] as their underlying assumption of noise station- arity. With the sequential noise estimation in this paper, we may extend these previous works [3] to time-varying environments. We conducted experiments on speech enhancement and speech recognition in time-varying noisy environment to verify the algo- rithm and validate its applicability. 2. SPEECH ENHANCEMENT WITH SEQUENTIAL NOISE PARAMETER ESTIMATION 2.1. Time-varying Linear Filtering Assume speech and noise are uncorrelated. In this context, the power spectrum of the input noisy signal at ﬁlter bin ( ), , can be considered as the summation of the power spectrum from the clean speech signal and the noise, i.e., (1) where superscript denotes linear spectral domain. The process of subtraction-type enhancement methods is equiv- alent to attenuating the above spectrum with a time-varying co- efﬁcient , i.e., . We consider two choices for speech enhancement because of their simplicity. 1. Wiener ﬁlter constructs the coefﬁcient as , where operator means absolute value, and I - 693 0-7803-8484-9/04/$20.00 ©2004 IEEE ICASSP 2004 ➠ ➡