Robust DNN-Based Speech Enhancement with Limited Training Data
Robert Rehr and Timo Gerkmann
Signal Processing (SP), Department of Informatics, Universität Hamburg, Germany
Email: {robert.rehr,timo.gerkmannn}@uni-hamburg.de
Abstract
In conventional speech enhancement, statistical models for
speech and noise are used to derive clean speech estimators.
The parameters of the models are estimated blindly from the
noisy observation using carefully designed algorithms. These
algorithms generalize well to unseen acoustic conditions, but
are unable to reduce highly non-stationary noise types. This
shortcoming motivated the usage of machine-learning-based
(ML-based) algorithms, in particular deep neural networks
(DNNs). But if only limited training data are available, the
noise reduction performance in unseen acoustic conditions
suffers. In this paper, motivated by conventional speech en-
hancement, we propose to use the a priori and a posteriori
signal-to-noise ratios (SNRs) for DNN-based speech enhance-
ment systems. Instrumental measures show that the proposed
features increase the robustness in unknown noise types even
if only limited training data are available.
1 Introduction
Speech plays a central role in the applications of many per-
sonal electronic devices, e.g., in hearing aids, mobile phones
and voice-controlled personal assistants. In noisy environ-
ments, the speech signal captured by the device’s microphones
may be corrupted by undesired background noise. Noise de-
grades the quality and potentially also the intelligibility of
speech. Further, noise deteriorates the performance of au-
tomatic speech recognition systems. To satisfy the demand
for high quality speech communication, enhancement algo-
rithms are utilized to reduce the detrimental effects of noise.
In this paper, single-channel speech enhancement algorithms
are considered. Such algorithms can be used to enhance noisy
speech signals captured by a single microphone and can also
be used to improve the output of spatial filtering approaches.
Single-channel speech enhancement has been a research
topic for several decades [1]–[5]. Many algorithms lever-
age the short-time Fourier transform (STFT) where the time-
frequency coefficients that are dominated by noise are atten-
uated. Conventional approaches assume the complex coef-
ficients of speech and noise to follow a known distribution
which is used to analytically derive statistically optimal esti-
mators [1], [6], [7]. Such estimators depend on the parameters
of the employed distributions which include the speech power
spectral density (PSD) and the noise PSD. The PSDs are es-
timated blindly from the noisy observation using specifically
designed algorithms [1], [8]–[10]. In this paper, we refer to
these conventional enhancement algorithms as non-machine-
learning-based (non-ML-based) enhancement schemes.
The shortcomings of non-ML-based algorithms, namely
the inability to suppress highly non-stationary noise types,
such as transients, and the speech distortions caused by these
algorithms, have motivated the usage of machine-learning-
based (ML-based) methods. Instead of estimating proper-
ties of speech and noise blindly from the noisy observations,
machine-learning (ML) algorithms leverage training exam-
ples to learn these properties prior to the processing. For
this, various ML algorithms have been employed, e.g., Gaus-
sian mixture models (GMMs) and hidden Markov models
(HMMs) [3], non-negative matrix factorization (NMF) [4] and
deep neural networks (DNNs) [5], [11]. Especially deep learn-
ing techniques show potential to improve speech enhancement
in highly non-stationary noises. However, the robustness in
unseen acoustic conditions is still discussed [12]–[14].
The generalization of DNN-based algorithms improves
generally with increasing number and diversity of the training
examples. However, for specific acoustic conditions, only
limited training data may be available or obtaining additional
training data may be expensive, e.g., in robotics. In this
paper, we propose a novel method to improve the general-
ization of DNN-based speech enhancement algorithms for
unseen acoustic conditions if only limited training data are
available. The proposed approach combines ML-based meth-
ods with non-ML-based noise and speech PSD estimators.
Despite the shortcomings of non-ML approaches in highly
non-stationary noise, non-ML-based algorithms have been
proven to be robust against many different acoustic environ-
ments. Further, these algorithms are invariant to changes of
the input level. Hence, we propose to use estimates of the
a priori signal-to-noise ratio (SNR), i.e., the ratio of speech
and noise PSD, and the a posteriori SNR, i.e., the ratio of
noisy input periodogram and noise PSD, as input features.
These features are motivated by conventional non-ML clean
speech estimators, which are often functions of these two
quantities. Further, the features have been previously used to
train data-driven gain functions [15], [16], but in contrast to
recent DNN-based approaches neighbouring frequency bands
have been assumed to be independent and no context has
been considered. In contrast to the previously proposed noise
aware training (NAT) [5], [17] and its dynamic variants [18],
[19], where the estimated noise PSD is appended to the input
features, the proposed features are normalized by the noise
PSD estimate. This is somewhat related to [20], where an
ideal ratio mask (IRM) [21] is predicted by a DNN, which
is used as an input for a following enhancement network. In
our work, we exploit the generalization of non-ML-based ap-
proaches such that training another DNN to predict the IRM
is avoided. A similar enhancement structure has been used
in [22], but the noise and the speech PSD have been estimated
using ML-based algorithms. We compare the proposed fea-
tures to NAT-based features using instrumental measures. In
case of limited training data, Perceptual Evaluation of Speech
Quality (PESQ) [23] indicates that the signal quality of the
enhanced signals is higher for the proposed features than for
NAT-based features in unseen noise conditions.
In Section 2, we recapitulate non-ML-based speech en-
hancement. Section 3 introduces the ML-based enhancement
scheme, recapitulates the previously used noise-aware fea-
tures and presents the proposed features. The evaluation and
the results are shown in Section 4.
2 Conventional Speech Enhancement
In this section, conventional non-ML-based speech enhance-
ment is considered and a brief overview of the used speech
ISBN 978-3-8007-4767-2 © VDE VERLAG GMBH Berlin Offenbach 126