LEARNING SPEECH FEATURES IN THE PRESENCE OF NOISE: SPARSE CONVOLUTIVE ROBUST NON-NEGATIVE MATRIX FACTORIZATION Ruair´ ı de Fr´ ein and Scott T. Rickard Complex and Adaptive Systems Laboratory University College Dublin Ireland ABSTRACT We introduce a non-negative matrix factorization technique which learns speech features with temporal extent in the presence of non-stationary noise. Our proposed technique, namely Sparse con- volutive robust non-negative matrix factorization, is robust in the presence of noise due to our explicit treatment of noise as an inter- fering source in the factorization. We derive multiplicative update rules using the alpha divergence objective. We show that our pro- posed method yields superior performance to sparse convolutive non-negative matrix factorization in a feature learning task on noisy data and comparable results to dedicated speech enhancement tech- niques. Index Terms— Spectral factorization, Speech enhancement 1. INTRODUCTION This paper describes a single channel method for enhancing speech corrupted by additive broad band, stationary or non-stationary noise. The first goal is to improve the quality of the speech by removing background noise without introducing artifacts, for example musical noise. The second goal is to learn speaker specific features given the noisy conditions. This is achieved by balancing the trade-off be- tween speech intelligibility (measured by Perceptual Evaluation of Speech Quality (PESQ) scores) and reduction in noise (measured in (SNR)). Sparse Convolutive Robust Non-negative Matrix Factor- ization (SCRNMF) combines the feature learning and enhancement tasks in one step. Every day use of mobile phones in noisy environments, such as, a busy city street, the carriage of a moving train, or a windy beach, keep speech enhancement developments to the fore in speech signal processing research. Given a mixture in the continuous time domain, v(t)= s(t)+ n(t) (1) of speech, s(t), and noise, n(t), spectral subtraction techniques [1] assume that the magnitude spectrogram of the corrupted speech is equal to sum of the magnitude spectrogram of the speech and the magnitude spectrogram of the noise. Similarly, assumptions of dis- joint orthogonality [2], independence of occurrence and the sparsity of speech in time-frequency [3], or the log-max approximation [4] are invoked in source separation tasks. Spectral subtraction assumes that mixtures of sources in time-frequency are approximately non- overlapping and that phase cross terms are zero. The assumption that, |V(n, k)| = |S(n, k)| + |N(n, k)|, (2) is a reasonable approximation when the speech and noise are un- correlated and for short term spectra, where V(n, k), S(n, k) and Supported by Science Foundation Ireland Grant No. 05/Y12/1677 N(n, k) are the discrete Short Time Fourier Transform (STFT) of the corrupted speech, the clean speech and the noise respectively. Enhancement is performed using the relationship |S ′ (n, k)| = |V(n, k)|−|N ′ (n, 1)|, (3) where stationary noise is removed by subtracting an estimate of the noise magnitude spectrogram, N ′ (n, 1), from the mixture. The au- thors in [1], assume that noise can be estimated a priori or during periods of silence and that the estimate remains stationary. The orig- inal phase is used to re-synthesize the enhanced speech, S ′ (n, k)= |S ′ (n, k)|e j arg V(n,k) . (4) Variations of the spectral subtraction method for noise reduction have been proposed [5] for dealing with its limitations, specifically the introduction of musical noise or musical tones artifacts. The cross-terms involving phase differences between the noisy and clean signals are estimated in [5], improving perceptual quality but at the expense of SNR. Approaches such as Wiener filtering require an ac- curate estimate of the corrupting noise, garnered from a period of noise training or using a speech detector, to achieve good perfor- mance. Non-negative Matrix Factorization (NMF) has been applied to speech de-noising in the magnitude spectrogram domain in [6] and [7], where the noise is non-stationary. Prior information is built into the factorization using a training step to estimate a basis indica- tive of the noise and speech sources in [6] and [7]. The authors in [6] leverage the co-occurrence statistics of these bases to remove noise from the mixture by adding a side-term to the objective which constrains the occurrence of each atom in the reconstruction based on pre-computed source statistics. Accuracy is dependent upon the length of training data. SCRNMF has its novelty in comparison with commonly adopted speech enhancement methods in that it performs speech feature learning which is robust in noise without priori knowledge of the speaker or the noise. We introduce the SCRNMF algorithm in Sec- tion 4. SCRNMF learns speech basis functions with temporal extent. We introduce Non-negative Matrix Factorization and its extensions in Sections 2. We compare the performance of SCRNMF with reg- ularized NMF and speech enhancement methods and illustrate the gains in performance in Section 5. 2. NON-NEGATIVE MATRIX FACTORIZATION Non negative matrix factorization, a low rank factorization tech- nique introduced in [8], decomposes a magnitude spectrogram, V ∈ R ≥0,M×N , into non-negative 1 spectral shapes, W ∈ R ≥0,M×R , 1 An element-wise non-negative matrix, A, of dimension X × Y is de- noted by A ∈ R ≥0,X×Y .