I.J. Image, Graphics and Signal Processing, 2013, 11, 13-22 Published Online September 2013 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijigsp.2013.11.02 Copyright © 2013 MECS I.J. Image, Graphics and Signal Processing, 2013, 11, 13-22 Spectral Subtractive-Type Algorithms for Enhancement of Noisy Speech: An Integrative Review Navneet Upadhyay 1 , Abhijit Karmakar 2 1 Department of Electrical & Electronics Engineering, Birla Institute of Technology and Science, Pilani 333031, India 2 Integrated Circuit Design Group, CSIR - Central Electronics Engineering Research Institute, Pilani 333031, India e-mail: navneet_upd@rediffmail.com 1 , abhijit @ceeri.ernet.in 2 Abstract — The spectral subtraction method is a classical approach for enhancement of speech degraded by additive background noise. The basic principle of this method is to estimate the short-time spectral magnitude of speech by subtracting estimated noise spectrum from the noisy speech spectrum. This is also achieved by multiplying the noisy speech spectrum with a gain function and later combining it with the phase of the noisy speech. Besides reducing the background noise, this method introduces an annoying perceptible tonal characteristic in the enhanced speech and affects the human listening, known as remnant musical noise. Several variations and implementations of this method have been adopted in past decades to address the limitations of spectral subtraction method. These variations constitute a family of subtractive-type algorithms and operate in frequency domain. The objective of this paper is to provide an extensive overview of spectral subtractive-type algorithms for enhancement of noisy speech. After the review, this paper is concluded by mentioning a future direction of speech enhancement research from spectral subtraction perspective. Index Terms — Speech enhancement, additive background noise, noise estimation, spectral subtractive- type algorithms, remnant musical noise I. INTRODUCTION Speech is one of the most prominent and primary modes of interaction between human-to-human and human-to-machine communication in various fields for instance automatic speech recognition and speaker identification [1]. The present day speech communication systems are severely degraded due to various types of unwanted random sound which make the listening task difficult for a direct listener and cause inaccurate transfer of information [2]. Therefore, enhancement speech is one of the main motives of various researching endeavors in the field of speech processing over the past few decades. The main objective of speech enhancement is to minimize the degree of distortion of the desired speech signal and to improve one or more perceptual aspects of speech, such as the quality and/or intelligibility. The quality of speech is a subjective measure which reflects the way that the signal is perceived by listeners. Intelligibility, on the other hand is an objective measure of the amount of information that can be extracted by listeners from the speech signal. These two measures are uncorrelated and independent of each other. A speech signal may be of high quality and low intelligibility and vice-versa [1-4]. The classification of speech enhancement method depends on the number of microphones that are used for collecting speech data, into single, dual or multi-channel. Although the multi-channel speech enhancement is better than that of single channel speech enhancement [1-2], yet the single channel speech enhancement is still a significant field of researching because of its simple implementation and ease of computation. Single channel speech enhancement uses only one microphone to collect noisy speech data [1-4]. The estimation of the spectral amplitude from the noisy data is easier than estimate of both the amplitude and phase. In [5-6], revealed that the short-time spectral amplitude (STSA) is more important than the phase information for the quality and intelligibility of speech. Therefore, single channel speech enhancement is usually divided into two classes based on the STSA estimation. The first class applies subtractive-type algorithms and attempt to estimate the short-time spectral magnitude (STSM) of speech by subtracting the estimated noise spectrum. Here, noise is estimated during speech pauses [7-11, 13]. The other class applies a spectral subtraction filter (SSF) to the noisy speech, so that the spectral amplitude of enhanced speech can be obtained. The design principle is to select appropriate parameters of the filter to minimize the difference between the enhanced speech and the clean speech signal [8]. In real-world listening environments, the speech is mostly degraded by additive noises [5, 9-14]. Additive noise is typically background noise which is uncorrelated with the clean speech signal in nature like white Gaussian noise (WGN), colored noise, multi- talker (babble) noise. The background noise may be stationary or non-stationary in nature. Therefore, the