Magnitude Spectrum Enhancement for Robust Speech Recognition
Wen-hsiang Tu and Jeih-weih Hung
Dept of Electrical Engineering, National Chi Nan University
Taiwan, Republic of China
e-mail: aero3016@ms45.hinet.net , jwhung@ncnu.edu.tw
Abstract
In this paper, an effective compensation scheme for the spectra
of speech signals is proposed in order to improve their noise
robustness. In this compensation scheme, named magnitude
spectrum enhancement (MSE), a voice activity detection (VAD)
process is first processed for the frame sequence of the utterance,
and then the magnitude spectra of non-speech frames are set to
be small while those of speech frames are amplified. In
experiments conducted on the Aurora-2 noisy digits database,
MSE achieves a relative error reduction rate of nearly 50% from
the baseline processing, which outperforms the well-known
spectral-domain speech enhancement techniques, spectral
subtraction (SS) and Wiener filtering (WF). In addition, the
proposed MSE can be integrated with cepstral-domain
robustness methods, like mean and variance normalization
(MVN) and histogram normalization (HEQ), to achieve further
improved recognition accuracy under noise-corrupted
environments.
Index Terms: speech recognition, robust speech features, speech
enhancement
1. Introduction
The environmental mismatch caused by additive noise and/or
channel distortion often degrades the performance of a speech
recognition system seriously. Various robustness techniques
have been proposed to reduce this mismatch, and one category
of them aims to obtaining robust speech features. As we know,
the mel-frequency cepstral coefficient (MFCC) is one of the
most widely used speech feature representations due to its high
recognition performance under a clean condition. However,
MFCC is not very noise-robust, and thus many robustness
techniques are applied in various domains of a noise-corrupted
speech signal when deriving MFCC. For example, the well-
known spectral subtraction (SS) [1] and Wiener filtering (WF)
[2,3] techniques are used in the linear spectral domain, and
various feature statistics normalization techniques, like cepstral
mean subtraction (CMS) [4], mean and variance normalization
(MVN)[5], MVN plus ARMA filtering (MVA) [6] and
histogram equalization (HEQ) [7] are often used in the cepstral
domain.
Besides dealing with the MFCC features, in many recent papers
[8-11] it is found that compensating the logarithmic energy
(logE) feature properly can improve the recognition accuracy
significantly under noisy conditions. In our previously proposed
method [10], silence feature normalization (SFN), the high-pass
filtered logE is used as the indicator for speech/non-speech
frame classification, and then the logE features of non-speech
frames are set to be small while those of speech frames are kept
nearly unchanged. We have shown that SFN is very effective
despite of its simplicity in implementation.
Partially motivated by the concept of SFN, in this paper we
propose a new approach, named magnitude spectrum
enhancement (MSE), to process the noise-corrupted signal in the
linear spectral domain, with the hope that the resulting MFCC
features can be more noise-robust. Briefly speaking, in MSE the
magnitude spectrum of each non-speech frame is set to be small
as SFN does, while the magnitude spectrum of each speech
frame is amplified by multiplying a weighting factor, which is
related to the signal-to-noise ratio (SNR). The main purpose of
MSE is to highlight the spectral difference between the speech
and non-speech frames, but it is not to re-construct the clean
speech spectrum, as SS and WF do. The experiments conducted
on the Aurora-2 digit database show that our proposed MSE can
provide a significant improvement in recognition accuracy under
various noise-corrupted environments. It performs better than SS
and WF, and it can be well integrated with cepstral-domain
processing techniques, like MVN, MVA and HEQ. The best
possible averaged accuracy rate for the Aurora-2 clean-condition
training task with the proposed method can be as high as 90.98%
The remainder of the paper is organized as follows: Section 2
introduces the proposed MSE method. The experimental setup is
described in Section 3, and the experiment results are given and
discussed in Section 4. Finally, Section 5 contains brief
concluding remarks and future works.
2. The Proposed Magnitude Spectrum Enhancement
Method
Assume that
\ ^
,0 1
m
x n n N
ッ
b b
。ー
「ア
is the time-domain signal
for the m
th
frame of an utterance. Taking the K-point DFT of
\ ^
m
x n
ッ
。ー
「ア
, we obtain the spectrum for this frame as follows,
2 1
0
, 0 2 , 1 ,
nk N
j
K
m m
n
X k x ne k K m M
Q
。 ー ッ ッ
b b b b
。ー 。ー 。 ー
「ア 「ア 「 ア
(1)
where M is the number of frames in this utterance. As a result,
m
X k
ッ
。ー
「ア
represents the magnitude spectrum for the k
th
frequency
bin of the m
th
frame in an utterance. On the other hand, the
logarithmic energy (logE) feature of the m
th
frame is calculated
as follows:
1
2
0
log , 1 .
N
m m
n
e x n m M
ャ
ュ
ッュ b b
ュ 。ー 「ア
ュ
ョ
(2)
Then the proposed magnitude spectrum enhancement (MSE)
approach uses the following two steps to create the new
magnitude spectrum:
Step I: Perform the process of voice activity detection (VAD):
The VAD process that discriminates speech/non-speech frames
in an utterance is based on two sources, the magnitude spectrum
in eq. (1) and the logE in eq. (2). As for the first source, since
the high-pass filtered logarithmic magnitude spectrum,
log
m
X k
ッ
。ー
「ア
, which can be viewed as the logE feature at the k
th
frequency bin, is shown to be more helpful in discriminating the
speech and non-speech portions [10], we first process the
sequence
\ ^
log
m
X k
ッ
。ー
「ア
with a high-pass IIR filter which input-
output relationship is
<> <> <>
1
log , 0 2 , 1
m m m
Y k X k Y k k K m M M
。 ー
b b b b
。 ー
「 ア
, (3)
Next, we sum up the high-pass filtered logarithmic spectrum,
m
Y k
ッ
。ー
「ア
, over the entire frequency band for each frame as follows:
4586 978-1-4244-4296-6/10/$25.00 ©2010 IEEE ICASSP 2010