SPEECH ENHANCEMENT BY PERCEPTUAL FILTER WITH SEQUENTIAL NOISE
PARAMETER ESTIMATION
Te-Won Lee andKaisheng Yao
Institute for Neural Computation, Univ. of California at San Diego
9500 Gilman Drive, La Jolla, CA 92093-0523
tewon@ucsd.edu, kyao@ucsd.edu
ABSTRACT
We report a work on speech enhancement that combines sequen-
tial noise estimation and perceptual filtering. The sequential es-
timation employs an extension of the sequential EM-type algo-
rithm. In this algorithm, statistics of clean speech are modeled by
hidden Markov models (HMM) and noise is assumed to be Gaus-
sian distributed with a time-varying mean vector (the noise pa-
rameter) to be estimated. The estimation process uses a non-linear
function that relates speech statistics, noise, and noisy observa-
tion. With the estimated noise parameter, subtraction-type algo-
rithm for speech enhancement may be extended to non-stationary
environments. In particular, a perceptual filter with frequency
masking is constructed with a tradeoff between noise reduction
and speech distortion considering the sensitivity of speech recog-
nition systems to speech distortion. Our experiments in speech
enhancement and speech recognition in non-stationary noise con-
firmed that this approach seems promising in improving perfor-
mances compared to alternative speech enhancement algorithms.
1. INTRODUCTION
The goal of speech enhancement is to recover original speech sig-
nals from noisy observations, and has been greatly studied in the
past decades [1]. Traditional methods [2][3] usually assume that
the statistics of the contaminating noise is known to the enhance-
ment system. In the simplest manner, the noise statistics can be
modeled by a simple Gaussian density, which assumes that noise
statistics is constant. More detailed modeling of noise statistics
may be done by using Gaussian mixture models (GMM). This
assumption requires sufficient amount of noise data to learn the
noise statistics. Unfortunately, the assumption may not hold in re-
alistic environments, where noise statistics may differ from those
during training, thus limiting the performance of these methods.
More recently, researchers have started to investigate speech
enhancement in time-varying noisy environments. Proposed meth-
ods assume a parametric function to relate speech and background
noise, and use sequential methods, e.g., sequential Monte Carlo [4]
and Bayesian inference [5]. These methods usually use HMM to
model clean speech statistics and a simple noise model with pa-
rameters to be estimated from noisy speech.
This paper presents a method for speech enhancement within
the above approach but makes the following contributions. First,
for the purpose of sequential noise parameter estimation, it is
beneficial to have algorithms with fast convergence rate and low
computational requirements. Since the noise parameter estima-
tion process involves estimation of hidden speech mixtures/states,
(deterministic or stochastic) EM-type algorithms have to be used.
This paper applies the sequential Kullback proximal algorithm
(SKP) [6], which is a sequential version of the Kullback proxi-
mal algorithm [7]. The Kullback proximal algorithm can achieve
faster convergence rate than the normal EM algorithm. Moreover,
the computational requirement for the SKP algorithm is much less
than some alternative methods [4][5]. Second contribution is a
subtraction-type speech enhancement algorithm that makes use
of the estimated noise statistics. The subtraction-type algorithm
is designed to consider a tradeoff between noise reduction and
speech distortion, as both have influences on speech recognition
system performances. The tradeoff may be achieved by retaining
a certain amount of residual noise in the enhanced speech sig-
nals. We suggest to employ human auditory properties [8] for
the design of the subtraction-type algorithm. Although human
auditory properties have been applied to some previous methods,
e.g., [3], these previous methods may not be able to handle time-
varying noise [3] as their underlying assumption of noise station-
arity. With the sequential noise estimation in this paper, we may
extend these previous works [3] to time-varying environments.
We conducted experiments on speech enhancement and speech
recognition in time-varying noisy environment to verify the algo-
rithm and validate its applicability.
2. SPEECH ENHANCEMENT WITH SEQUENTIAL
NOISE PARAMETER ESTIMATION
2.1. Time-varying Linear Filtering
Assume speech and noise are uncorrelated. In this context, the
power spectrum of the input noisy signal at filter bin (
), , can be considered as the summation of the power
spectrum from the clean speech signal and the noise, i.e.,
(1)
where superscript denotes linear spectral domain.
The process of subtraction-type enhancement methods is equiv-
alent to attenuating the above spectrum with a time-varying co-
efficient , i.e., .
We consider two choices for speech enhancement because of their
simplicity.
1. Wiener filter constructs the coefficient as
, where operator means absolute value, and
I - 693 0-7803-8484-9/04/$20.00 ©2004 IEEE ICASSP 2004
➠ ➡