SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS Hajar Momeni 1,2,* , Emanu¨ el A. P. Habets 1 and Hamid Reza Abutalebi 2 1 International Audio Laboratories Erlangen, Germany † 2 Electrical and Computer Engineering Dept., Yazd University, Iran h momeni@stu.yazd.ac.ir, emanuel.habets@audiolabs-erlangen.de and habutalebi@yazd.ac.ir ABSTRACT The speech presence probability (SPP) plays an important role in many noise reduction and noise estimation methods. The SPP is commonly computed per time and frequency in the short time Fourier transform (STFT) domain based on the a priori speech absence probability and the a priori and a posteriori signal-to- noise ratios. Due to the STFT as well as the nature of the speech signal, there exists a correlation between subsequent time frames and neighboring frequency bands. In this work, we explicitly take these inter-frame and inter-band correlations into account when computing the SPP. The presented results demonstrate that we can increase the detection accuracy of the SPP estimator by taking a few neighboring time and frequency bins into account. Index Terms— speech presence probability, inter-frame corre- lations, inter-band correlations. 1. INTRODUCTION For many noise reduction and noise estimation methods, an estima- tor for the speech presence probability (SPP) in each time-frequency (TF) unit is of great interest. Clean-speech estimators, for exam- ple, are often derived under the assumption that speech is actually present. As this assumption is true neither during speech pauses nor between spectral bins of the harmonics of a voiced sound, the SPP should be taken into account [1–4]. Available noise power spectral density (PSD) estimators also make use of the SPP to decide when to update the noise PSD [5–7]. The SPP is commonly computed per TF unit in the short time Fourier transform (STFT) domain based on the a priori speech ab- sence probability and the a priori and a posteriori signal-to-noise ratios. Most a posteriori speech presence probability (SPP) estima- tors are derived under the assumption that the spectral coefﬁcients of the speech and noise can be modeled using complex Gaussian ran- dom variables. Moreover, it is commonly assumed that the time and frequency units are mutually uncorrelated across time and frequency. The spectral coefﬁcients obtained after computing the STFT are both correlated across time and frequency. In addition, subsequent time frames are correlated due to the short-term stationarity of the speech signal, and neighboring frequency bins are correlated due to the har- monic structure of voiced speech segments [8]. In recent works [8–10], the inter-band correlations were explic- itly used to derive novel noise reduction ﬁlters. In other works † A joint institution of the University of Erlangen-Nuremberg and Fraun- hofer IIS, Germany. * Ms Momeni was a Visiting Researcher at the AudioLabs from Septem- ber 2013 till February 2014. (c.f. [9, 11, 12]), the inter-frame correlations have been used to de- rive novel single and multichannel noise reduction ﬁlters. In [12], a single-channel noise reduction ﬁlter that uses the inter-frame cor- relations was derived that is able to reduce noise without distorting the desired speech. In [13], a fullband voice activity detector was proposed that takes the inter-band correlations into account. In [14], Gerkmann et al. noted that SPP estimators that rely on an obser- vation of the noisy periodogram suffer from random ﬂuctuations. Among other modiﬁcations, they proposed to compute an average of the a posteriori signal-to-noise ratio (SNR) under the assumption that the speech energy is distributed homogeneously over a small spectrogram region. Although the correlation that results from the spectral analysis is partly taken into account, the correlation due to the speech or noise signal is not taken into account. In this paper, our goal is to estimate the narrowband SPP us- ing a single noisy speech signal. In particular, we explicitly exploit the inter-frame and inter-band correlations when estimating the SPP in each TF unit. The obtained SPP estimator is similar to the one presented in [15] that was developed to exploit inter-channel corre- lations. In [15] a simpliﬁed SPP estimator was obtained under the implicit assumption that the correlation matrix of the desired signal is of rank one. Here, we investigate the performance of the SPP estimator with full rank and rank one assumptions. The presented results demonstrate that we can increase the detection accuracy of the SPP estimator by taking a few neighboring time and frequency bins into account. The paper is organized as follows: in Section 2, the problem is formulated. In Section 3, the SPP estimator is derived that is able to take both inter-frame and inter-band correlations into account. In Section 4, the experimental results are provided and discussed. Fi- nally, Section 5 concludes the paper. 2. PROBLEM FORMULATION We consider the well-accepted signal model in which a microphone captures a desired signal that is corrupted by additive noise. In the short-time Fourier transform (STFT) domain we can express the spectral coefﬁcients of the received signal at time-frame m and discrete-frequency k as Y (k, m)= X(k, m)+ V (k, m), (1) where X(k, m) is the desired signal and V (k, m) is the addi- tive noise. We assume that the spectral coefﬁcients X(k, m) and V (k, m) are uncorrelated and zero-mean complex Gaussian random variables. Because of the properties of the STFT and the nature of the speech signal, it is likely that the TF unit of interest is correlated with 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) 978-1-4799-2893-4/14/$31.00 ©2014 IEEE 2927