Robust DNN-Based Speech Enhancement with Limited Training Data Robert Rehr and Timo Gerkmann Signal Processing (SP), Department of Informatics, Universität Hamburg, Germany Email: {robert.rehr,timo.gerkmannn}@uni-hamburg.de Abstract In conventional speech enhancement, statistical models for speech and noise are used to derive clean speech estimators. The parameters of the models are estimated blindly from the noisy observation using carefully designed algorithms. These algorithms generalize well to unseen acoustic conditions, but are unable to reduce highly non-stationary noise types. This shortcoming motivated the usage of machine-learning-based (ML-based) algorithms, in particular deep neural networks (DNNs). But if only limited training data are available, the noise reduction performance in unseen acoustic conditions suffers. In this paper, motivated by conventional speech en- hancement, we propose to use the a priori and a posteriori signal-to-noise ratios (SNRs) for DNN-based speech enhance- ment systems. Instrumental measures show that the proposed features increase the robustness in unknown noise types even if only limited training data are available. 1 Introduction Speech plays a central role in the applications of many per- sonal electronic devices, e.g., in hearing aids, mobile phones and voice-controlled personal assistants. In noisy environ- ments, the speech signal captured by the device’s microphones may be corrupted by undesired background noise. Noise de- grades the quality and potentially also the intelligibility of speech. Further, noise deteriorates the performance of au- tomatic speech recognition systems. To satisfy the demand for high quality speech communication, enhancement algo- rithms are utilized to reduce the detrimental effects of noise. In this paper, single-channel speech enhancement algorithms are considered. Such algorithms can be used to enhance noisy speech signals captured by a single microphone and can also be used to improve the output of spatial filtering approaches. Single-channel speech enhancement has been a research topic for several decades [1]–[5]. Many algorithms lever- age the short-time Fourier transform (STFT) where the time- frequency coefficients that are dominated by noise are atten- uated. Conventional approaches assume the complex coef- ficients of speech and noise to follow a known distribution which is used to analytically derive statistically optimal esti- mators [1], [6], [7]. Such estimators depend on the parameters of the employed distributions which include the speech power spectral density (PSD) and the noise PSD. The PSDs are es- timated blindly from the noisy observation using specifically designed algorithms [1], [8]–[10]. In this paper, we refer to these conventional enhancement algorithms as non-machine- learning-based (non-ML-based) enhancement schemes. The shortcomings of non-ML-based algorithms, namely the inability to suppress highly non-stationary noise types, such as transients, and the speech distortions caused by these algorithms, have motivated the usage of machine-learning- based (ML-based) methods. Instead of estimating proper- ties of speech and noise blindly from the noisy observations, machine-learning (ML) algorithms leverage training exam- ples to learn these properties prior to the processing. For this, various ML algorithms have been employed, e.g., Gaus- sian mixture models (GMMs) and hidden Markov models (HMMs) [3], non-negative matrix factorization (NMF) [4] and deep neural networks (DNNs) [5], [11]. Especially deep learn- ing techniques show potential to improve speech enhancement in highly non-stationary noises. However, the robustness in unseen acoustic conditions is still discussed [12]–[14]. The generalization of DNN-based algorithms improves generally with increasing number and diversity of the training examples. However, for specific acoustic conditions, only limited training data may be available or obtaining additional training data may be expensive, e.g., in robotics. In this paper, we propose a novel method to improve the general- ization of DNN-based speech enhancement algorithms for unseen acoustic conditions if only limited training data are available. The proposed approach combines ML-based meth- ods with non-ML-based noise and speech PSD estimators. Despite the shortcomings of non-ML approaches in highly non-stationary noise, non-ML-based algorithms have been proven to be robust against many different acoustic environ- ments. Further, these algorithms are invariant to changes of the input level. Hence, we propose to use estimates of the a priori signal-to-noise ratio (SNR), i.e., the ratio of speech and noise PSD, and the a posteriori SNR, i.e., the ratio of noisy input periodogram and noise PSD, as input features. These features are motivated by conventional non-ML clean speech estimators, which are often functions of these two quantities. Further, the features have been previously used to train data-driven gain functions [15], [16], but in contrast to recent DNN-based approaches neighbouring frequency bands have been assumed to be independent and no context has been considered. In contrast to the previously proposed noise aware training (NAT) [5], [17] and its dynamic variants [18], [19], where the estimated noise PSD is appended to the input features, the proposed features are normalized by the noise PSD estimate. This is somewhat related to [20], where an ideal ratio mask (IRM) [21] is predicted by a DNN, which is used as an input for a following enhancement network. In our work, we exploit the generalization of non-ML-based ap- proaches such that training another DNN to predict the IRM is avoided. A similar enhancement structure has been used in [22], but the noise and the speech PSD have been estimated using ML-based algorithms. We compare the proposed fea- tures to NAT-based features using instrumental measures. In case of limited training data, Perceptual Evaluation of Speech Quality (PESQ) [23] indicates that the signal quality of the enhanced signals is higher for the proposed features than for NAT-based features in unseen noise conditions. In Section 2, we recapitulate non-ML-based speech en- hancement. Section 3 introduces the ML-based enhancement scheme, recapitulates the previously used noise-aware fea- tures and presents the proposed features. The evaluation and the results are shown in Section 4. 2 Conventional Speech Enhancement In this section, conventional non-ML-based speech enhance- ment is considered and a brief overview of the used speech ISBN 978-3-8007-4767-2 © VDE VERLAG GMBH Berlin Offenbach 126