Vol.:(0123456789) 1 3
International Journal of Speech Technology
https://doi.org/10.1007/s10772-018-9506-9
Speech enhancement by combining spectral subtraction
and minimum mean square error‑spectrum power estimator based
on zero crossing
Thimmaraja G. Yadava
1
· H. S. Jayanna
2
Received: 16 August 2017 / Accepted: 29 March 2018
© Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Speech data collected under uncontrolled environment need to be processed to build a robust automatic speech recognition
system. In this paper, a method is proposed to process the degraded speech signal. Initially, the signifcance of the spectral
subtraction with voice activity detection (SS-VAD) and magnitude squared spectrum estimators are studied for diferent
types of noises. In SS-VAD method, the degraded speech data is sampled and windowed into 50% overlapping. The VAD
is used to detect the voiced regions of speech signal. The minimum mean square error-short time power spectrum, mini-
mum mean square error-spectrum power based on zero crossing (MMSE-SPZC) and maximum a posteriori estimators are
studied individually. These MSS estimators are implemented on the assumption that the magnitude squared spectrum of the
degraded speech signal is the sum of the clean (original) speech signal and noise model. The experimental results show that
the MMSE-SPZC estimator gives better performance compared to the other two methods. This estimator is combined with
SS-VAD method to improve the performance. In this paper, the combined SS-VAD and MMSE-SPZC method, yields better
speech quality by reducing noise in degraded speech signal compared to the individual methods.
Keywords Automatic speech recognition (ASR) · Spectral subtraction (SS) voice activity detection (VAD) · Magnitude
squared spectrum (MSS) · Speech data
1 Introduction
Speech enhancement mainly depends on the human percep-
tual factors and signal processing applications. The speech
data collected in the real time environment is noisy in nature.
Normally speech is corrupted by several degradations such
as background noise, vocal noise, factory noise, f16 noise,
babble noise and reverberations etc. The noise reduction
in degraded speech data is a challenging task (Rabiner and
Juang 1993; Loizou 2007). The spectral subtraction (SS)
method is commonly used for speech enhancement and is
mainly associated with voice activity detection (VAD). To
fnd the active regions of degraded speech signal, VAD is
used (Ramirez et al. 2003). The corrupted speech signal is
the sum of clean (original) speech signal and additive noise
model.
The degraded speech segments are processed frame by
frame with a duration of 20 ms. The SS-VAD method was
proposed for speech enhancement in Boll (1979), Kamath
and Loizou (2002), Jounghoon and Hanseok (2003), Cole
et al. (2008) and Goodarzi and Seyedtabaii (2009). The
efect of noise can be eliminated in degraded speech signal
by subtracting the average magnitude spectrum of noise
model from the average magnitude spectrum of degraded
speech signal. The process of using several noise elimina-
tion techniques for speech enhancement is called speech
preprocessing (Loizou 2007). The modifed SS algorithm
was proposed for speech enhancement in Bing et al.
(2009). This algorithm was implemented by using VAD
and minima controlled recursive averaging (Cohen and
Berdugo 2002). The experimental results are evaluated
under ITU-T G.160 standard and compared with exist-
ing methods. In Huanhuan et al. (2012), an improved SS
* Thimmaraja G. Yadava
thimrajyadav@gmail.com
1
Department of Electronics and Communication Engineering,
Siddaganga Institute of Technology, Tumkur, Karnataka,
India
2
Department of Information Science and Engineering,
Siddaganga Institute of Technology, Tumkur, Karnataka,
India