Vol.:(0123456789) 1 3 International Journal of Speech Technology https://doi.org/10.1007/s10772-018-9506-9 Speech enhancement by combining spectral subtraction and minimum mean square error‑spectrum power estimator based on zero crossing Thimmaraja G. Yadava 1 · H. S. Jayanna 2 Received: 16 August 2017 / Accepted: 29 March 2018 © Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract Speech data collected under uncontrolled environment need to be processed to build a robust automatic speech recognition system. In this paper, a method is proposed to process the degraded speech signal. Initially, the signifcance of the spectral subtraction with voice activity detection (SS-VAD) and magnitude squared spectrum estimators are studied for diferent types of noises. In SS-VAD method, the degraded speech data is sampled and windowed into 50% overlapping. The VAD is used to detect the voiced regions of speech signal. The minimum mean square error-short time power spectrum, mini- mum mean square error-spectrum power based on zero crossing (MMSE-SPZC) and maximum a posteriori estimators are studied individually. These MSS estimators are implemented on the assumption that the magnitude squared spectrum of the degraded speech signal is the sum of the clean (original) speech signal and noise model. The experimental results show that the MMSE-SPZC estimator gives better performance compared to the other two methods. This estimator is combined with SS-VAD method to improve the performance. In this paper, the combined SS-VAD and MMSE-SPZC method, yields better speech quality by reducing noise in degraded speech signal compared to the individual methods. Keywords Automatic speech recognition (ASR) · Spectral subtraction (SS) voice activity detection (VAD) · Magnitude squared spectrum (MSS) · Speech data 1 Introduction Speech enhancement mainly depends on the human percep- tual factors and signal processing applications. The speech data collected in the real time environment is noisy in nature. Normally speech is corrupted by several degradations such as background noise, vocal noise, factory noise, f16 noise, babble noise and reverberations etc. The noise reduction in degraded speech data is a challenging task (Rabiner and Juang 1993; Loizou 2007). The spectral subtraction (SS) method is commonly used for speech enhancement and is mainly associated with voice activity detection (VAD). To fnd the active regions of degraded speech signal, VAD is used (Ramirez et al. 2003). The corrupted speech signal is the sum of clean (original) speech signal and additive noise model. The degraded speech segments are processed frame by frame with a duration of 20 ms. The SS-VAD method was proposed for speech enhancement in Boll (1979), Kamath and Loizou (2002), Jounghoon and Hanseok (2003), Cole et al. (2008) and Goodarzi and Seyedtabaii (2009). The efect of noise can be eliminated in degraded speech signal by subtracting the average magnitude spectrum of noise model from the average magnitude spectrum of degraded speech signal. The process of using several noise elimina- tion techniques for speech enhancement is called speech preprocessing (Loizou 2007). The modifed SS algorithm was proposed for speech enhancement in Bing et al. (2009). This algorithm was implemented by using VAD and minima controlled recursive averaging (Cohen and Berdugo 2002). The experimental results are evaluated under ITU-T G.160 standard and compared with exist- ing methods. In Huanhuan et al. (2012), an improved SS * Thimmaraja G. Yadava thimrajyadav@gmail.com 1 Department of Electronics and Communication Engineering, Siddaganga Institute of Technology, Tumkur, Karnataka, India 2 Department of Information Science and Engineering, Siddaganga Institute of Technology, Tumkur, Karnataka, India