SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION
USING INTER-FRAME AND INTER-BAND CORRELATIONS
Hajar Momeni
1,2,*
, Emanu¨ el A. P. Habets
1
and Hamid Reza Abutalebi
2
1
International Audio Laboratories Erlangen, Germany
†
2
Electrical and Computer Engineering Dept., Yazd University, Iran
h momeni@stu.yazd.ac.ir, emanuel.habets@audiolabs-erlangen.de and habutalebi@yazd.ac.ir
ABSTRACT
The speech presence probability (SPP) plays an important role in
many noise reduction and noise estimation methods. The SPP
is commonly computed per time and frequency in the short time
Fourier transform (STFT) domain based on the a priori speech
absence probability and the a priori and a posteriori signal-to-
noise ratios. Due to the STFT as well as the nature of the speech
signal, there exists a correlation between subsequent time frames
and neighboring frequency bands. In this work, we explicitly take
these inter-frame and inter-band correlations into account when
computing the SPP. The presented results demonstrate that we can
increase the detection accuracy of the SPP estimator by taking a few
neighboring time and frequency bins into account.
Index Terms— speech presence probability, inter-frame corre-
lations, inter-band correlations.
1. INTRODUCTION
For many noise reduction and noise estimation methods, an estima-
tor for the speech presence probability (SPP) in each time-frequency
(TF) unit is of great interest. Clean-speech estimators, for exam-
ple, are often derived under the assumption that speech is actually
present. As this assumption is true neither during speech pauses nor
between spectral bins of the harmonics of a voiced sound, the SPP
should be taken into account [1–4]. Available noise power spectral
density (PSD) estimators also make use of the SPP to decide when
to update the noise PSD [5–7].
The SPP is commonly computed per TF unit in the short time
Fourier transform (STFT) domain based on the a priori speech ab-
sence probability and the a priori and a posteriori signal-to-noise
ratios. Most a posteriori speech presence probability (SPP) estima-
tors are derived under the assumption that the spectral coefficients of
the speech and noise can be modeled using complex Gaussian ran-
dom variables. Moreover, it is commonly assumed that the time and
frequency units are mutually uncorrelated across time and frequency.
The spectral coefficients obtained after computing the STFT are both
correlated across time and frequency. In addition, subsequent time
frames are correlated due to the short-term stationarity of the speech
signal, and neighboring frequency bins are correlated due to the har-
monic structure of voiced speech segments [8].
In recent works [8–10], the inter-band correlations were explic-
itly used to derive novel noise reduction filters. In other works
†
A joint institution of the University of Erlangen-Nuremberg and Fraun-
hofer IIS, Germany.
*
Ms Momeni was a Visiting Researcher at the AudioLabs from Septem-
ber 2013 till February 2014.
(c.f. [9, 11, 12]), the inter-frame correlations have been used to de-
rive novel single and multichannel noise reduction filters. In [12],
a single-channel noise reduction filter that uses the inter-frame cor-
relations was derived that is able to reduce noise without distorting
the desired speech. In [13], a fullband voice activity detector was
proposed that takes the inter-band correlations into account. In [14],
Gerkmann et al. noted that SPP estimators that rely on an obser-
vation of the noisy periodogram suffer from random fluctuations.
Among other modifications, they proposed to compute an average
of the a posteriori signal-to-noise ratio (SNR) under the assumption
that the speech energy is distributed homogeneously over a small
spectrogram region. Although the correlation that results from the
spectral analysis is partly taken into account, the correlation due to
the speech or noise signal is not taken into account.
In this paper, our goal is to estimate the narrowband SPP us-
ing a single noisy speech signal. In particular, we explicitly exploit
the inter-frame and inter-band correlations when estimating the SPP
in each TF unit. The obtained SPP estimator is similar to the one
presented in [15] that was developed to exploit inter-channel corre-
lations. In [15] a simplified SPP estimator was obtained under the
implicit assumption that the correlation matrix of the desired signal
is of rank one. Here, we investigate the performance of the SPP
estimator with full rank and rank one assumptions. The presented
results demonstrate that we can increase the detection accuracy of
the SPP estimator by taking a few neighboring time and frequency
bins into account.
The paper is organized as follows: in Section 2, the problem is
formulated. In Section 3, the SPP estimator is derived that is able
to take both inter-frame and inter-band correlations into account. In
Section 4, the experimental results are provided and discussed. Fi-
nally, Section 5 concludes the paper.
2. PROBLEM FORMULATION
We consider the well-accepted signal model in which a microphone
captures a desired signal that is corrupted by additive noise. In
the short-time Fourier transform (STFT) domain we can express
the spectral coefficients of the received signal at time-frame m and
discrete-frequency k as
Y (k, m)= X(k, m)+ V (k, m), (1)
where X(k, m) is the desired signal and V (k, m) is the addi-
tive noise. We assume that the spectral coefficients X(k, m) and
V (k, m) are uncorrelated and zero-mean complex Gaussian random
variables.
Because of the properties of the STFT and the nature of the
speech signal, it is likely that the TF unit of interest is correlated with
2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP)
978-1-4799-2893-4/14/$31.00 ©2014 IEEE 2927